Building and Measuring Privacy-Preserving Predictive Blacklists

by   Luca Melis, et al.

Collaborative security initiatives are increasingly often advocated to improve timeliness and effectiveness of threat mitigation. Among these, collaborative predictive blacklisting (CPB) aims to forecast attack sources based on alerts contributed by multiple organizations that might be targeted in similar ways. Alas, CPB proposals thus far have only focused on improving hit counts, but overlooked the impact of collaboration on false positives and false negatives. Moreover, sharing threat intelligence often prompts important privacy, confidentiality, and liability issues. In this paper, we first provide a comprehensive measurement analysis of two state-of-the-art CPB systems: one that uses a trusted central party to collect alerts [Soldo et al., Infocom'10] and a peer-to-peer one relying on controlled data sharing [Freudiger et al., DIMVA'15], studying the impact of collaboration on both correct and incorrect predictions. Then, we present a novel privacy-friendly approach that significantly improves over previous work, achieving a better balance of true and false positive rates, while minimizing information disclosure. Finally, we present an extension that allows our system to scale to very large numbers of organizations.



There are no comments yet.


page 1

page 2

page 3

page 4


On Collaborative Predictive Blacklisting

Collaborative predictive blacklisting (CPB) allows to forecast future at...

Privacy-preserving and Trusted Threat Intelligence Sharing using Distributed Ledgers

Threat information sharing is considered as one of the proactive defensi...

A Reputation-based Approach using Consortium Blockchain for Cyber Threat Intelligence Sharing

The CTI (Cyber Threat Intelligence) sharing and exchange is an effective...

TRIDEnT: Building Decentralized Incentives for Collaborative Security

Sophisticated mass attacks, especially when exploiting zero-day vulnerab...

A Blockchain-Enabled Incentivised Framework for Cyber Threat Intelligence Sharing in ICS

In recent years Industrial Control Systems (ICS) have been targeted incr...

Collaborative and Privacy-Preserving Machine Teaching via Consensus Optimization

In this work, we define a collaborative and privacy-preserving machine t...

Multi-modal Identification of State-Sponsored Propaganda on Social Media

The prevalence of state-sponsored propaganda on the Internet has become ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Filtering connections from/to hosts regarded as malicious is a common practice to reduce the number and the impact of attacks. As it is unfeasible to perform expensive classification tasks on each connection, filtering is typically performed using periodically updated lists of suspicious hosts – or blacklists. These can be created locally or obtained from alert repositories, such as

Zhang et al. [36] are the first to introduce the concept of collaborative predictive blacklisting (CPB), whereby different entities send their logs to a trusted entity, which in turn provides customized blacklists based on relevance ranking. The intuition is that attacks are often correlated, mounted by the same sources against different networks. In fact, Katti et al. [22] show that attack correlation persists over time, suggesting that collaboration between victims can significantly improve malicious IP detection time. Soldo et al. [33] then improve on [36] relying on clustering and implicit recommendation. Overall, collaborative approaches to threat mitigation, besides CPB, are increasingly advocated, with more and more efforts promoting information sharing, e.g., by CERT [6], Facebook [2], and the White House [34].

In this paper, we focus on two main challenges with respect to the impact of collaboration on (i) false positives/negatives and (ii) privacy. Prior CPB proposals [33, 36] only focus on improving “hit counts”, i.e., the number of successfully predicted attacks, but fail to account for incorrect predictions (i.e., false positives/negatives). Furthermore, real-world deployment of collaborative blacklisting is hindered by confidentiality, liability, trust, and competitiveness concerns, as sharing alerts could harm an organization’s reputation and lead to disclosure of sensitive information about customers and business practices [1, 3]. Freudiger et al. [16]

are the first to investigate privacy-friendly approaches to CPB: their intuition is to let organizations interact in a pairwise manner to privately estimate the benefits of collaboration, and then have them share data with entities that are likely to yield the most benefits. However, as discussed later in Section 

4, their pairwise protocol by Freudiger et al. only scales to a few collaborators.

1.1 Roadmap

First, we provide an experimental evaluation of existing collaborative predictive blacklisting (CPB) proposals (Section 4). We use alerts contributed to by 70 organizations reporting an average of 4,000 daily events over a 15-day time window. We re-implement and compare the centralized technique by Soldo et al. [33] vs. the peer-to-peer one based on controlled data sharing by Freudiger et al. [16]. We find that the former achieves high hit counts (almost doubling correct predictions compared to no collaboration) and relatively high recall (), but its accuracy is ultimately quite poor (F114%) due to a significant increase in false positives ( precision). Whereas, the latter only slightly improves hit counts compared to no collaboration, but also yields fewer incorrect predictions, thus resulting in better accuracy overall (up to F1).

Then, we propose a novel approach aiming to capture the best of the two worlds (Section 5). We use a hybrid model relying on a semi-trusted authority (STA), which acts as a coordinating entity to facilitate clustering, without having access to the raw data. The STA clusters contributors based on the similarity of their logs (without seeing these logs), and helps organizations in the same cluster to share relevant data. We experiment with a few clustering algorithms using the number of common attacks as a measure of similarity, which is computed in a privacy-preserving way, and experiment with privacy-friendly within-clusters sharing strategies, i.e., only disclosing the details of common attacks and/or privately discovering correlated attacks. Our experimental results show that our hybrid model balances out the increase in hit counts and in false positives brought about by information sharing ( recall, precision, F1 score), achieving up to 4x increase in hit counts compared to Freudiger et al. [16] and doubling overall accuracy compared to Soldo et al. [33]. Finally, we present a scalability analysis of our scheme (Section 6), introducing a simple variant that allows it to efficiently scale to very large numbers of organizations.

1.2 Contributions

In summary, our paper makes two main contributions. First, we present a measurement study of existing CPB approaches aiming to capture the overall effects of collaboration, highlighting important open problems. Second, we introduce a novel, privacy-friendly and scalable approach to CPB that achieves a better balance between hit counts and incorrect predictions. Our system minimizes the amount of information disclosed in the process and achieves scalability in the presence of large numbers of collaborating entities.

2 Related Work

Collaborative Intrusion Detection. Katti et al. [22] are among the first to measure correlated attacks, i.e., attacks mounted by the same sources against different networks, establishing that they are very common yet highly targeted. They show that attack correlation persists over time and suggest that collaboration between victims could significantly improve malicious IP detection time. In [36], Zhang et al. introduce highly predictive blacklisting, having different organizations contribute alerts to a central repository, such as, which in turn provides them with daily personalized (predictive) blacklists. The prediction uses a relevance ranking scheme similar to PageRank, measuring the correlation of an attacker to a contributor based on their history as well as the attacker’s recent log production patterns. Then, Soldo et al. [33] improve on [36] using an implicit recommendation system to discover similar victims as well as clusters of correlated victims and attackers. In their model, the presence of attacks performed by the same source around the same time leads to stronger similarity among victims, and a neighborhood model (k-NN) is applied to cluster similar victims. Cross Association (CA) co-clustering [7] is then used to discover groups of correlated attackers and victims, and prediction within the cluster is done via a time-series algorithm – Exponentially Weighted Moving Average (EWMA) – capturing attacks’ temporal trends.

Beyond blacklisting, other work focuses on other collaborative security problems. Felegyhazi et al. [14] perform proactive prediction of malicious domain use: starting from a seed of confirmed bad domains, they predict clusters of related domains based on name server features (zone files containing sub-domains and authoritative name servers), and infer new bad domains. Liu et al. [26], based on externally observable properties of an organization’s network, aim to predict breaches without the organization’s cooperation. Woods et al. [35] apply data mining to identify subsets of shared information that are semantically related, while Garrido et al. [17] introduce game-theoretic models to analyze the effects of cyber-security information sharing among organizations. Sirivianos et al. [32] propose a collaborative system that enables hosts with no email classification functionality to check whether a host is a spammer or not. Each host then assesses the trustworthiness of spam reporters by auditing their reports and leveraging the social network of the reporters’ administrators.

Privacy In Collaborative Intrusion Detection. Porras and Shmatikov [30] discuss privacy risks prompted by sharing security-related data and propose anonymization and sanitization techniques to address them. However, follow-up work [9, 25] demonstrates that these techniques make data less useful and anyway prone to de-anonymization.

Burkhart et al. [5] introduce a few privacy-preserving protocols based on secure multiparty computation (MPC) for aggregation of network statistics. This is also explored in [4], where entities send encrypted data to a central repository that aggregates contributions. However, statistics only identify the most prolific attack sources and yield global models, which, as discussed in [36], miss a significant number of attacks and yield poor prediction performance. Nagaraja et al. [28] introduce an inference algorithm, BotGrep, to privately discover botnet hosts and links in network traffics, relying on Private Set Intersection [12]. Davidson et al. [10] propose a game-theoretic model for software vulnerability sharing between two competing parties. Their protocol relies on a private set operation (PSO) technique to limit the amount of information disclosed. However, it does not scale for more than two entities. Finally, Freudiger et al. [16] focus on collaborative predictive blacklisting based on a pairwise controlled data sharing approach. They focus on identifying which metrics (e.g., number of common attacks) can be used to privately estimate the benefits of collaboration between two organizations, rather than proposing a deployable system. In fact, as discussed later, their pairwise approach does not scale to many organizations.

3 Preliminaries

3.1 Background

Time Series Prediction. We use Exponentially Weighted Moving Average (EWMA) to perform prediction. Given a signal over time , we indicate with the predicted value of , given past observations at time . The predicted signal is computed as:


where is a smoothing coefficient, denotes the training window, and is the time slot to be predicted. For small values of , EWMA aggregates past information uniformly across the training window, while, with a large , the prediction algorithm focuses more on events taking place in the recent past.


To evaluate the performance of the predictions, we use true and false positives and negatives, denoted, respectively, as TP, FP, TN, FN. We derive precision (PPV), recall (TPR), and F1-Score, respectively, as TP/(TP+FP), TP/(TP+FN), and the harmonic mean of PPV and TPR. We also measure the average improvement/increase in TP, FP, and FN when compared to when no collaboration occurs between organizations, using, resp.,

, , and , where the notation denotes values after collaboration.

Cryptographic protocols. In the rest of the paper, we use a number of cryptographic protocols for privacy-preserving computations. To ease presentation, we defer their description to Appendix A.

3.2 Dataset

Aiming to design a meaningful empirical analysis of CPB, we gather a dataset of blacklisted IP addresses from, a collaborative firewall log correlation system to which various organizations volunteer daily alerts. Each entry in the logs consists of a pseudonymized Contributor ID (the target), the source IP address (the attacker), source and target port number, and the timestamp. With DShield’s permission, we have collected logs using a JavaScript web crawler, gathering, on average, 10 million logs every day. We exclude entries for invalid or non-routable IP addresses, and discard port numbers, then, for each IP address, we extract its /24 subnet and use /24 addresses for all experiments, following experimental choices made in prior work [16, 33, 36].

Training and Testing Sets. We use the DShield dataset both as a training set and a testing set (i.e., ground truth), considering a sliding window of 5 days for training and 1 day for testing, as done in [16, 33]. We select a 15-day period (i.e., 10 sliding windows) and restrict our evaluations to a reasonably-sized sample of regularly contributing organizations. We select the top-100 contributors, based on the number of unique IPs reported, provided that they report logs every day. Most contributors (around 60%) submit less than 100K logs, while 20% submit between 100K and 500K, and only a handful of organizations contribute very large amounts of logs (above 1M). We then pick 70 organizations, for each time window, leaving out the top-10 and the bottom-20 contributors. We do so, like in previous work [16, 33], to minimize bias. Our final sample dataset includes 30 million attacks, contributed by 118 unique organizations (as the 70 contributors selected in each time window vary) over 15 days, each reporting a daily average of 600 suspicious (unique) IPs and 4,000 attack events.

Note that we have also repeated our experiments on a larger number of organizations (150) and on two more sets of DShield logs using 15-day periods from other time intervals, but have not found any significant difference in the results. Moreover, we remark that the way we count FP is an upper bound since we do not have ground truth as to whether the absence of an IP from the testing set occurs when the IP is not suspicious or if it simply does not generate requests. Nonetheless, it does not really matter for our evaluations since our main goal is to compare different approaches with each other.

Notation. We use to denote a group of organizations, where each holds a dataset of alerts, i.e., suspicious IP addresses along with the related timestamps. We aim to predict IP addresses generating attacks to each in the next day, using, as the training set, both its local dataset , as well the set , with suspicious IP addresses obtained by collaborating with other organizations. As discussed above, for each time window, we consider organizations using alerts collected from DShield.

4 Existing Collaborative Predictive Blacklisting Approaches

4.1 Soldo et al. [33]’s Implicit Recommendation (No Privacy)

We first evaluate Soldo et al [33]’s CPB approach based on implicit recommendation. We do so with a twofold goal: (1) to evaluate false positives and false negatives, which were not taken into consideration in [33], and (2) to compare against privacy-friendly approaches, presented later. Essentially, Soldo et al.’s work builds on [36], which bases on a relevance ranking scheme similar to PageRank, measuring the correlation of an attacker to a contributor relying on their history as well as the attacker’s recent log production patterns. Soldo et al. improve [36] by using an implicit recommendation system to discover similar victims as well as groups of correlated victims and attackers. The presence of attacks performed by the same source around the same time leads to stronger victim similarity, and a neighborhood model (k-NN) is applied to cluster similar victims. Cross Association (CA) co-clustering [7] is then used to discover groups of correlated attackers and victims, and prediction within the cluster is done via EWMA to capture attacks’ temporal trends. We have re-implemented their system in Python and used Chakrabarti’s implementation of Cross Association (CA) as per [7], and run the experiments on the dataset introduced in Section 3.2.

Figure 1: Soldo et al. [33]: (a) TP improvement, (b) FP increase (y-axis in log scale), (c) FN increase, (d) TPR, (e) Precision, and (f) F1 measure.

We start by measuring the basic predictor which only relies on a local EWMA time series algorithm (TS), using as it yields the best results, then, apply the co-clustering techniques (TS-CA), and, finally, implement their full scheme by combining k-NN to cluster victims based on their similarity with CA and TS (TS-CA-k-NN). Fig. 1 illustrates the improvement/increase in TP, FP, FN (compared to the TS baseline) as well as TPR, PPV, and F1, with various values (ranging from 1 to 35) used by the k-NN algorithm to discover similar organizations. Obviously, the k-NN parameter does not affect TS-CA and TS.

Fig. 0(a) shows that, with TS-CA-k-NN, increases significantly with , almost doubling the “hit counts” compared to the TS baseline, whereas, TS-CA improves less (). On the other hand, however, there is too, 5- to 50-fold, as clusters become bigger (Fig. 0(b)), and naturally, this stark increase in FP leads to low precision, as shown in Fig. 0(e). FNs also always increase compared to TS (Fig. 0(c)), specifically, they double with TS-CA and increase between and (less for larger values) compared to TS. Moreover, affects TPR (Fig. 0(d)), with an increase between and . does not correspond to a comparable increase in TPR, due to the poor FN performance, as shown by the fact that TS-CA-k-NN reaches in but only at most TPR compared to with the baseline TS. Overall, Soldo et al.’s techniques do not perform well in practice, as they yield lower F1 measures ( with TS-CA, and at most with TS-CA-k-NN) than a simple local time-series prediction () – see Fig. 0(f).

Figure 2: Freudiger et al. [16] (a) TP improvement, (b) FP increase, (c) FN increase, (d) TPR, (e) Precision and (f) F1 measure, with increasing percentage of global pairs (approach (A)).
Figure 3: Freudiger et al. [16] (a) TP improvement, (b) FP increase, (c) FN increase, (d) TPR, (e) Precision and (f) F1 measure, with increasing number of local pairs (approach (B)).

4.2 Freudiger et al. [16]’s Peer-to-Peer Controlled Data Sharing

Next, we evaluate the system by Freudiger et al. [16], whereby organizations interact pairwise, aiming to privately estimate the benefits of collaboration, and then share data with entities that are likely to yield the most benefits. The authors also use DShield data and perform prediction using EWMA. They find that: (1) the number of common attacks is the best predictor of benefits, which can be estimated privately, using Private Set Intersection Cardinality (PSI-CA) [11]; and (2) sharing only the intersection of attacks – which can be done privately using Private Set Intersection (PSI) [12] – is almost as beneficial as sharing everything.111See Appendix A for background on these cryptographic protocols. Their goal is really to assess benefit estimation/sharing strategies, rather than to focus on deployment. They assume a network of 100 organizations, select the “top 50” among all possible 4950 pairs (in terms of estimated benefits), and only experiment on those. Naturally, without a coordinating entity, it is impossible to rank the pairs, so they suggest that collaboration should take place: either between organizations whose estimated benefits are above a threshold, although it is not stated how to set this threshold; or by each organization selecting the top ones with the biggest estimated benefits, but do not experiment with or discuss how impacts overhead or true/false positives. Thus, we replicate both approaches, specifically, by experimenting: (A) with the top 1% to 5% of global pairs, and (B) having each organization pick to most similar organizations.

Fig. 2 shows the improvement/increase in TP, FP, FN (compared to a baseline with no sharing) as well as TPR, PPV and F1 with increasing percentage of global pairs (A). Similarly, Fig. 3 presents the results for approach (B).

Looking at , (A) yields increase when of global pairs are selected whereas for (B), i.e. picking local pairs, increases along with the number of local pairs selected. When an organization collaborates with its most similar ones, we observe a while when it collaborates with , reaches (see Fig. 2(a)). Figs. 1(b) and 2(b), show that (A) has a rather small ( increase when the top pairs are selected) compared to (B) which is affected by the number of pairs that each organizations picks for collaboration. When an organization collaborates with 5 others a is observed on average while when it collaborates with 30 others reaches . Moreover, Figs. 1(c) and 2(c) illustrate that both approaches achieve a decrease in false negatives with the second approach achieving bigger decreases as the number of collaborators increases. Finally, in (A), when of all pairs (i.e., when pairs are selected), we get an F1 measure of . As per (B), we observe that F1 measures are slightly affected by the number of pairs that each organization chooses for collaboration. The best F1 score () is obtained when an organization picks

others for collaboration. Overall, both approaches improve precision and recall of the system, resulting in an enhanced F1 score, compared to a local approach. Although the increase in TP is not as high as with the non-private approach of 

[33], a more balanced increase of false positives and a decrease of the false negatives seems possible. However, the system is limited in scalability, due to its peer-to-peer design. Moreover, when organizations share only the intersection of their datasets it is not possible for them to obtain events about new attackers that they have never witnessed before in their own datasets.

5 A Novel Approach to Privacy-Friendly CPB

In this section, we introduce a novel privacy-friendly system which relies on a semi-trusted authority, or STA, acting as a coordinating entity to facilitate clustering without having access to the raw data. In other words, the STA clusters contributors based on the similarity of their logs (without accessing these logs), and helps organizations in the same cluster to share relevant logs.

5.1 Overview

Our system involves four steps: (1) First, organizations interact in a pairwise manner to privately compute a similarity measure of their logs, based on the number of common attacks. Then, (2) the STA collects the similarity measures from each organization to a matrix depicted as O2O and performs clustering on it. Next, (3) the STA reports to each organization the identifiers of other organizations in the same cluster (if any), so that they collaboratively, yet privately, share logs to boost the accuracy of their predictions, by either sharing common attacks (intersection), correlated attacks (IP2IP), or both.222For comparison, we also consider baseline approaches, i.e., sharing nothing (local) or sharing everything (global). Finally, (4) each organization performs EWMA prediction based on their logs, augmented with those shared from entities in the same cluster.

This approach is hybrid in that interaction is privacy-friendly, using a central party (STA) which is not trusted with data in-the-clear, but only with similarity measures. Moreover, we follow a data minimization approach as organizations only share information about common and correlated attacks: specifically, in (1), the number of common attacks is computed privately using PSI-CA [11], while, in (3), sharing of common and correlated attacks is also privacy-preserving, as we rely on PSI [12] and private aggregation [27].

Settings. We use datasets and settings from Section 3.2

. We cluster organizations (Step (2)) utilizing various algorithms, i.e., Agglomerative Clustering, k-means, and k-NN. Moreover, for the IP2IP method, we only consider the top-1000 attackers (i.e., the top-1000

heavy hitters) in each cluster, for each 5-day training-set window – rather than looking for correlations over all the /24 IP space – and for each IP we extract 50 correlated ones. Finally, as in previous experiments, we set for EWMA.

Figure 4: Agglomerative Clustering: (a) TP improvement, (b) FP increase (y-axis in log scale), (c) FN increase, (d) TPR, (e) Precision, (f) F1 measure.
Figure 5: k-means: (a) TP improvement, (b) FP increase (y-axis in log scale), (c) FN increase, (d) TPR, (e) Precision, (f) F1 measure.
Figure 6: k-NN: (a) TP improvement, (b) FP increase (y-axis in log scale), (c) FN increase, (d) TPR, (e) Precision, (f) F1 measure.

5.2 Results

We now present the results of our extensive experimental evaluation. As we use different clustering algorithms, we refer the reader to Appendix B

for a brief overview of them. Our experiments are written in Python, using the scikit-learn machine learning suite, and will be made available with the final version of the paper.

Agglomerative Clustering. We consider different numbers of desired clusters (), ranging from 1 to 35, setting affinity and linkage parameters to cosine and average, respectively, to indicate what distance measures to use between sets of observations. In Fig. 3(a)3(c), we plot average , , and with increasing number of clusters. Unsurprisingly, IP2IP achieves smaller results (at most ) than global (up to ), which however incurs higher () and above all (). Fig. 3(d) shows that recall (TPR) always improves when sharing, with intersection reaching . When combining intersection and IP2IP, TPR slowly degrades with smaller clusters (peaks at for ), while, with IP2IP, it increases ( for ). Because of the increase in FN, global performs worse in terms of recall () although obtaining the best . From Fig. 3(e), we observe that local yields the best precision (), followed by intersection (, slightly growing for larger ), while IP2IP and IP2IP+intersection slowly increase up to . Global performs poorly overall (up to ) due to high FP. Finally, Fig. 3(f) plots the F1 measure: intersection achieves slightly better scores than local ( vs ), while its combination with IP2IP, or just IP2IP are slightly worse ().

k-means. Next, we use k-means for clustering and obtain results similar to agglomerative clustering. Thus, we decide to restrict to stronger correlations

, by only taking into account organizations closer to the cluster’s centroid, and excluding the rest of them as outliers. We set a distance threshold and experiment with it empirically, finding that the optimal setting is the 40th percentile, i.e., the cluster distance value below which 40% of the organizations can be found. Fig. 

4(a)4(c) plot the average improvement in TP and increase in FP and FN, respectively. is almost constant with IP2IP () independent of the cluster sizes, while with the other methods it decreases faster due to the distance thresholds, ranging from with global for to of intersection for . IP2IP shows steady values compared to other methods (, i.e., a decrease) which leads to a better performance in TPR, as shown in Fig. 4(d), for (up to ). Furthermore, intersection yields the best performance in (), with . Fig. 4(f) shows the best F1 measure () is reached with , due to a peak both in PPV and TPR. IP2IP performs slightly worse () than local () while poor F1 values for global, with , () are due to its bad PPV () – see Fig. 4(e).

k-NN. Recall that indicates the number of nearest neighbors that each entity considers as its most similar ones. Thus, organizations can end up in more than one neighborhood. Since the algorithm builds a neighborhood for each organization, not all clusters have the same strength, so we only consider strong clusters in terms of their members similarity and as done with k-means, after tuning the parameters, we set a distance threshold as the 40th percentile to leave possible outliers out of the clusters. From Fig. 5(a), we observe that IP2IP+intersection yields the second best performance in (, with ), while global peaks at . In terms of , IP2IP doubles it (for ), while intersection achieves the lowest value with (again, for ). As with previous clustering algorithms, we notice that intersection yields the best decrease in FN, i.e., with . Intersection also achieves the highest TPR (up to ) with larger cluster sizes (i.e., for , while its combination with the IP2IP reduces it () – see Fig. 5(d). Fig. 5(e) shows that intersection has the best PPV ( for ), similar to local (), while IP2IP performs worse () due to higher (almost doubling the FP for ). Finally, from Fig. 5(f), note that intersection yields the highest F1 ( for ).

Setting Max F1 [Sharing Intersection] Clustering Avg Size #Coll. TPR PPV Agglomerative  15 4.6 700 0.72 0.16 -0.42 0.27 k-means 5 5.8 280 0.73 0.19 -0.32 0.30 k-NN 15 6 240 0.74 0.19 -0.37 0.30
Table 1: Best Cases of our Experiments for F1.
Setting Max TPR [Sharing Intersection]

Avg Size

#Coll. TPR PPV
Agglomerative 1 70 700 0.76 0.15 -0.53 0.25
k-means 5 5.8 280 0.73 0.19 -0.32 0.30
k-NN  35 14 320 0.77 0.17 -0.49 0.28
Table 2: Best Cases of our Experiments for TPR.
Setting Max PPV [Sharing Intersection]

Avg Size

#Coll. TPR PPV
Agglomerative 25 2.8 700 0.69 0.16 -0.35 0.26
k-means 5 5.8 280 0.73 0.19 -0.32 0.30
k-NN 15 6 240 0.74 0.19 -0.37 0.30
Table 3: Best Cases of our Experiments for PPV.
Setting Max TP Improvement [Sharing IP2IP+intersection ]

Avg Size

#Coll. TPR PPV
Agglomerative 1 70 700 0.67 0.11 -0.08 0.19
k-means 1 28 270 0.64 0.11 -0.17 0.18
k-NN  35 14 320 0.71 0.14 -0.19 0.23
Table 4: Best Cases of our Experiments for .

5.3 Discussion

We summarize the best results for each clustering algorithm, in terms of highest F1, recall, precision, and , in Tables 44. We note that intersection is that sharing mechanism that maximizes all metrics, except for , which is instead maximized with IP2IP+intersection. Both k-means and k-NN peak at in F1 including, respectively, and collaborators over all time windows. Agglomerative clustering involves all contributors and achieves . k-NN, yields the best results for TPR (), while both k-NN, and k-means, achieve in PPV. In terms of , k-means reaches a maximum of with and clusters of size on average, selecting collaborators overall. Slightly lower improvements are achieved with other clustering algorithms, but with more collaborators benefiting from sharing, as well as fewer FP.

Main take-aways. Data sharing always helps organizations forecast attacks, compared to performing predictions locally. Predicting based on all data from collaborators yields the highest improvement in – especially for bigger clusters – but with a dramatic increase in . When organizations share correlated attacks (IP2IP), we observe a steady , while sharing common attacks (intersection) outperforms the former when bigger clusters are formed. However, intersection introduces lower , ultimately leading to better precision and F1 measures. IP2IP+intersection always outperforms the two separate methods in terms of , thus, it is the recommended strategy if one only wants to maximize the number of predicted attacks.

Impact of cluster size. With agglomerative clustering, each organization is assigned to exactly one cluster and thus participates in/benefits from collaboration. We observe higher TPR for bigger clusters and, generally, a stable improvement in TP is achieved on average. Similar results are obtained with k-means when all organizations are assigned to clusters. However, when we set a distance threshold, creating more consistent clusters, we observe fluctuations in TPR: as clusters get smaller much faster (in relation to value), IP2IP starts outperforming intersection. This indicates that correlated attacks can improve knowledge of organizations and enhance their local predictions, especially in smaller clusters. With k-NN, a different behavior is observed: for smaller clusters, IP2IP achieves higher TPR (up to for ) but, as clusters get bigger, intersection yields the best results (up to for ). Overall, collaborating in big clusters leads to high but at the same time it introduces significant .

Increase/Improvement in TP/FP/FN. We also find that for all clustering algorithms, maximizing always leads to higher , from of k-NN up to of Agglomerative. The settings that maximize the F1 measure, TPR, and PPV, (when sharing intersection) also minimize , e.g. agglomerative with achieves . In general, we observe that (privacy-friendly) collaboration does yield a remarkable increase in TP but also in FP, which results in a limited improvement in F1 score compared to predicting using local logs only. However, as discussed earlier, note that we count FP in a conservative way and that our main goal is really to measure the effect of different collaboration strategies on the prediction (as well as comparing to state of the art CPB techniques [16, 33]), seeking to improve TP while keeping the increase in FP as low as possible.

5.4 Comparison to Soldo et al. and Freudiger et al.

We observe that [33] achieves higher maximum () than our hybrid approach ( with k-means, ). However, our privacy-preserving techniques outperform [33] both in terms of recall (TPR) and precision (PPV). For instance, with k-NN, our system yields a TPR of 0.77 (vs. for [33]). Similarly, with k-means, our system’s PPV reaches 0.19 whereas [33]’s best precision is . As a result, our hybrid model yields larger (up to 2x) F1 scores (e.g., with k-NN, ) than [33] ().

Moreover, our novel hybrid approach yields better results than [16]-(A), in terms of and TPR. For example, in [16]-(A) reaches only up to 0.13 (top 3% of global pairs) while in our system it reaches 0.61 with k-means, (i.e., up to 4x improved hit counts). Likewise, we achieve a TPR of 0.77 (e.g., k-NN, ) while the best TPR of [16]-(A) is 0.66 (top 1% of global pairs). Although the F1 scores achieved in both cases are similar – 0.30 for our hybrid system vs. 0.29 for [16]-(A) – with our model more organizations benefit from collaboration. Finally, we observe that our hybrid approach yields fairly similar results to [16]-(B) in terms of TPR, PPV and F1. Nevertheless, our system achieves better than [16]-(B) (0.61 for k-means, vs. 0.45 with 35 local pairs) since not only common but also correlated attacks are shared within the clusters.

6 Implementing At Scale

As discussed above, our system involves four steps: (1) secure computation of pairwise similarity, (2) clustering, (3) secure data sharing within the clusters, and (4) time-series prediction. To assess its scalability, we need to evaluate computation and communication complexities incurred by each step. Naturally, (1) and (3) dominate complexities as they require running a number of cryptographic protocols that depends on the number of organizations involved. In fact, clustering incurs a negligible overhead: on commodity hardware, to perform clustering with 1,000 organizations, it takes for k-means, for agglomerative, and for k-NN (). Also, time-series EWMA prediction requires per IP, so it takes for 1,000 IPs.

As we compute pairwise similarity based on the amount of common attacks between two organizations, and support its secure computation via PSI-CA [11], step (1) requires a number of protocol runs quadratic in the number of organizations. In our experiments (see details below), it takes and MB bandwidth for one protocol execution, using 2048-bit moduli, with sets of size 4,000 (the average number of attacks observed by each organization). As for (3), i.e., secure within-cluster sharing of events related to common attacks (intersection), we rely on PSI-DT [12], and it takes and MB for a single execution with the same settings. Thus, complexities may quickly become prohibitive when more organizations are involved or more alerts are used.

Server-aided Secure Computation. Aiming to improve scalability, we introduce a variant supporting secure computation of pairwise similarity as well as secure log sharing without a quadratic number of public-key operations/quadratic communication overhead. Recall that we rely on a semi-trusted authority, STA, for clustering and coordination, which is assumed to follow protocol specifications and not to collude with other organizations, thus, we can actually use it to also help with secure computations. Inspired by Kamara et al.’s server-aided PSI [21], we extend our framework by replacing public-key cryptography operations with pseudo-random permutations (PRP), which we instantiate using AES. Specifically, we minimize interactions among pairs of organizations so that the complexity incurred by each of them is constant, while only imposing a minimal, linear communication overhead on STA.

Our extension involves four phases: (1) setup, where, as in [21], one organization generates a random key and sends it to the other organizations, (2) encryption (Algorithm 1), where each organization evaluates the PRP on each entry in their sets and encrypts the associated timestamp , (3) O2O computation (Algorithm 2), where STA computes the magnitude of common attacks between each pair of organizations in order to perform clustering, and (4) log sharing (Algorithm 3), where organizations in the same cluster receive information about common attacks (-s).

  for each  do
     , ,
     for each  do
         for  to  do
     Send to and store
Algorithm 1 Encryption [All Organizations]
  for each  do
     for each  do
  Perform Clustering on O2O
  Send relevant Buff entries to organizations in the same cluster
Algorithm 2 O2O Computation [STA]
  for each  do
     for each  do
         for each  do
Algorithm 3 Log Sharing [Organizations in ]

Note that building the O2O matrix is actually optimized using hash tables (i.e., dense_hash_set and dense_hash_map from Sparehash [19]). Also, since sets in our system are multi-sets, we concatenate counters to the IP address, so that the STA cannot tell which and how many IPs appear more than once.

Figure 7: Computation (a) and communication (b) overhead at each organization for PSI-CA, PSI-DT, and PRP-based scheme, and communication overhead at the STA in PRP scheme (c).

Experimental Evaluation. To fully grasp the scalability of the server-aided extension, and compare it to using “traditional” PSI-CA and PSI-DT, we report execution times for increasing number of participating organizations. We benchmark the performance of PSI-CA [11] and PSI-DT [12] using 2048-bit moduli, modifying the OpenSSL/GMP-based C implementation of [13], as well as the PRP-based scheme presented above and inspired by Kamara et al.’s work [21]. Experiments are run using two 2.3GHz Intel Core i5 CPUs with 8GB of RAM connected via a Mbps Ethernet link.

Figures 6(a) and 6(b) plot computation and communication complexities incurred by an individual organization vis-à-vis the total number of organizations involved in the system, while Fig. 6(c) reports the communication overhead introduced on the STA-side for the PRP scheme. Observe that complexities for PSI-CA/PSI-DT protocols on each organization grow linearly in the number of organizations (hence, quadratic overall). For instance, if 1,000 organizations are involved, it would take about 16 minutes per organization, each transmitting 1GB. Whereas, the PRP-based scheme incurs constant complexities on each organization ( and KB) and an appreciably low communication overhead on the STA (about 100MB) for 1,000 organizations.

IP2IP. We also evaluate the IP2IP method whereby organizations interact with STA in order to discover cluster-wide correlated attacks. Assuming clusters of organizations and an IP2IP matrix of (recall that we consider the whole /24 IP space), we measure a running time per organization with KB of bandwidth as well as a overhead on the STA with MB bandwidth. Recall that we use the private Count-Min sketch based implementation by Melis et al. [27], which results in the private aggregation of 10,336 elements. Note that, even if clusters are bigger than 100, as detailed in [27], one can still perform private aggregation on multiple subgroups (e.g., of size 100) without endangering organizations’ privacy.

Security. Our system does not leak any information about the logs of each organization to the STA, with or without using the server-aided variant. Clustering is performed over similarity measures computed obliviously to STA, and so does within-cluster data sharing. Privacy-preserving computation occurs by using existing secure protocols such as PSI-CA/PSI-DT by De Cristofaro et al. [12, 11]), server-aided PSI by Kamara et al. [21], as well as private recommendation via succinct sketches by Melis et al. [27]. Therefore, we do not provide any additional proofs in the paper as the security of our techniques straightforwardly relies on that of these protocols.

7 Conclusion

In this paper, we first presented the results of a comprehensive measurement study of collaborative predictive blacklisting (CPB) techniques, specifically, one relying on trusted central party by Soldo et al. [33] and another using privacy-preserving data sharing by Freudiger et al. [16]. Then, we introduced and evaluated a novel, hybrid approach that improves upon the first two. Our experiments evaluated correct and incorrect predictions, as well as the real-world impact of collaboration (e.g., the improvement on true positives and the increase of false positives/negatives), using a dataset of alerts obtained from We found that, overall, having access to more (attack) logs does not necessarily result in better predictions – in fact, the approach proposed by Soldo et al. [33], although considered the state of the art in CPB, achieves high hit counts (almost doubling the number of correct predictions) but also suffers from very poor precision due to very high FP. On the other hand, the privacy-friendly decentralized system proposed by Freudiger et al. [16] achieves better accuracy than [33], but with a much smaller improvement in TP. Moreover, their system does not scale due its peer-to-peer nature.

Then, our analysis shows that our novel hybrid approach manages to outperform [33] in terms of accuracy (up to 2x) and [16] in terms of hit counts (up to 4x), while maintaining an acceptable level of privacy and achieving high scalability. As part of future work, we plan to conduct a longitudinal measurement to fully grasp the effectiveness of privacy-enhanced CPB in the wild, as well as study other collaborative security problems, e.g., in the context of spam, malware samples, and DNS poisoning.


  • [1] U.S. Anti-Bot Code of Conduct for Internet service providers: Barriers and Metrics Considerations [PDF]., 2013.
  • [2] Facebook ThreatExchange., 2015.
  • [3] S. Ackerman. Privacy experts question Obama’s plan for new agency to counter cyber threats – The Guardian., 2015.
  • [4] B. Applebaum, H. Ringberg, M. Freedman, M. Caesar, and J. Rexford. Collaborative, privacy-preserving data aggregation at scale. In PETS, 2010.
  • [5] M. Burkhart, M. Strasser, D. Many, and X. Dimitropoulos. SEPIA: Privacy-Preserving Aggregation of Multi-Domain Network Events and Statistics. In USENIX Security, 2010.
  • [6] CERT UK. Cyber-security Information Sharing Partnership (CiSP)., 2015.
  • [7] D. Chakrabarti, S. Papadimitriou, D. S. Modha, and C. Faloutsos. Fully automatic cross-associations. In ACM KDD, 2004.
  • [8] G. Cormode and S. Muthukrishnan. An Improved Data Stream Summary: The Count-Min Sketch and Its Applications. Journal of Algorithms, 2005.
  • [9] S. E. Coull, C. V. Wright, F. Monrose, M. P. Collins, M. K. Reiter, et al. Playing Devil’s Advocate: Inferring Sensitive Information from Anonymized Network Traces. In NDSS, 2007.
  • [10] A. Davidson, G. Fenn, and C. Cid. A model for secure and mutually beneficial software vulnerability sharing. In ACM Workshop on Information Sharing and Collaborative Security, 2016.
  • [11] E. De Cristofaro, P. Gasti, and G. Tsudik. Fast and Private Computation of Cardinality of Set Intersection and Union. In CANS, 2012.
  • [12] E. De Cristofaro and G. Tsudik. Practical private set intersection protocols with linear complexity. In Financial Cryptography and Data Security, 2010.
  • [13] E. De Cristofaro and G. Tsudik. Experimenting with fast private set intersection. In TRUST, 2012.
  • [14] M. Felegyhazi, C. Kreibich, and V. Paxson. On the potential of proactive domain blacklisting. In LEET, 2015.
  • [15] M. Freedman, K. Nissim, and B. Pinkas. Efficient private matching and set intersection. In Eurocrypt, 2004.
  • [16] J. Freudiger, E. De Cristofaro, and A. Brito. Controlled Data Sharing for Collaborative Predictive Blacklisting. In DIMVA, 2015.
  • [17] R. Garrido-Pelaz, L. González-Manzano, and S. Pastrana. Shall we collaborate?: A model to analyse the benefits of information sharing. In ACM Workshop on Information Sharing and Collaborative Security, 2016.
  • [18] O. Goldreich. Foundations of Cryptography, chapter 7.2.2. Cambridge Univ Press, 2004.
  • [19] D. Hide. Sparehash., 2013.
  • [20] Y. Ishai, J. Kilian, K. Nissim, and E. Petrank. Extending oblivious transfers efficiently. In CRYPTO, 2003.
  • [21] S. Kamara, P. Mohassel, M. Raykova, and S. Sadeghian. Scaling private set intersection to billion-element sets. In Financial Cryptography and Data Security. 2014.
  • [22] S. Katti, B. Krishnamurthy, and D. Katabi. Collaborating against common enemies. In ACM IMC, 2005.
  • [23] L. Kissner and D. Song. Privacy-Preserving Set Operations. In CRYPTO, 2005.
  • [24] K. Kursawe, G. Danezis, and M. Kohlweiss. Privacy-friendly Aggregation for the Smart-grid. In Privacy Enhancing Technologies, 2011.
  • [25] K. Lakkaraju and A. Slagell. Evaluating the utility of anonymized network traces for intrusion detection. In SecureComm, 2008.
  • [26] Y. Liu, A. Sarabi, J. Zhang, P. Naghizadeh, M. Karir, M. Bailey, and M. Liu. Cloudy with a Chance of Breach: Forecasting Cyber Security Incidents. In USENIX Security, 2015.
  • [27] L. Melis, G. Danezis, and E. De Cristofaro. Efficient Private Statistics with Succinct Sketches. In NDSS, 2016.
  • [28] S. Nagaraja, P. Mittal, C.-Y. Hong, M. Caesar, and N. Borisov. Botgrep: Finding p2p bots with structured graph analysis. In USENIX Security, 2010.
  • [29] B. Pinkas, T. Schneider, and M. Zohner. Faster Private Set Intersection based on OT Extension. In USENIX Security, 2014.
  • [30] P. Porras and V. Shmatikov. Large-scale collection and sanitization of network security data: risks and challenges. In NSPW, 2006.
  • [31] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-based Collaborative Filtering Recommendation Algorithms. In WWW, 2001.
  • [32] M. Sirivianos, K. Kim, and X. Yang. Socialfilter: Introducing social trust to collaborative spam mitigation. In INFOCOM, 2011.
  • [33] F. Soldo, A. Le, and A. Markopoulou. Predictive blacklisting as an implicit recommendation system. In INFOCOM, 2010.
  • [34] The White House. Executive order promoting private sector cybersecurity information sharing., 2015.
  • [35] B. Woods, S. J. Perl, and B. Lindauer. Data mining for efficient collaborative information discovery. In ACM Workshop on Information Sharing and Collaborative Security, 2015.
  • [36] J. Zhang, P. A. Porras, and J. Ullrich. Highly predictive blacklisting. In USENIX, 2008.

Appendix A Cryptography Background

Adversarial Model. We use standard security models for secure two-party computation and consider semi-honest adversaries. In the rest of this paper, the term adversary refers to insiders, i.e., protocol participants. Outside adversaries are not considered, since their actions can be mitigated via standard network security techniques. Following definitions in [18], protocols secure in the presence of semi-honest adversaries assume that parties faithfully follow all protocol specifications and do not misrepresent any information related to their inputs, e.g., size and content. However, during or after protocol execution, any party might (passively) attempt to infer additional information about the other party’s input. This model is formalized by considering an ideal implementation where a trusted third party (TTP) receives the inputs of both parties and outputs the result of the defined function. Security in the presence of semi-honest adversaries requires that, in the real implementation of the protocol (without a TTP), each party does not learn more information than in the ideal implementation. Finally, we assume that parties do not collude with each other to recover other participants inputs.

Private Set Intersection (PSI): a cryptographic protocol between two parties, server and client, on input, respectively, and . At the end, the client learns . There are several PSI instantiations, with different complexities and cryptographic assumptions, ranging from those based on Oblivious Polynomial Evaluation (OPE) [15], to linear-complexity protocols based Oblivious PseudoRandom Functions (OPRFs) [12], as well as optimized garbled circuits [29] leveraging Oblivious Transfer Extension [20]. Naturally, the PSI definition above implies that only one party (client) learns the set intersection, however, in the semi-honest model, PSI can trivially be turned to a “mutual PSI” [23] (i.e., both parties learn the intersection) by executing PSI twice with inverted roles.

Private Set Intersection Cardinality (PSI-CA): a cryptographic protocol between two parties, server and client, on input, respectively, and . At the end, the client learns . That is, PSI-CA is a more stringent version of PSI as the client only learns how many items are in intersection. While it is possible to modify garbled circuits based PSI constructions to support PSI-CA [29], to the best of our knowledge, there is no available description of the corresponding circuit or ready-to-use implementation, therefore, we use the special purpose PSI-CA protocol from [11]. This protocol is secure in the Random Oracle Model under the One-More Diffie-Hellman assumption in the presence of semi-honest adversaries. It incurs communication and computational complexities linear in the size of the sets: parties need to exchange group items, and compute modular exponentiations with short exponents. Similar to PSI, in the semi-honest model, two executions of PSI-CA with inverted roles yield a mutual PSI-CA where both parties learn the cardinality of the set intersection.

PSI with Data Transfer (PSI-DT): a cryptographic protocol between server and client on input, respectively, and . At the end, the client obtains . In other words, the client not only learns which items are in the intersection, but also gets related data records. Special purpose protocols for PSI-DT have been proposed [15, 12], but we do not know of any available garbled circuits based instantiation. Hence, we use the PSI-DT protocol described in [12], secure in the Random Oracle Model under the One-More RSA assumption in the presence of semi-honest adversaries. It incurs communication and computational complexities linear in the size of the sets: parties need to exchange group items, and compute RSA-CRT exponentiations and modular multiplications if one picks a small RSA public exponent (e.g., 3 or 17). Once again, in the semi-honest model, two executions of PSI-DT with inverted roles trivially yield a mutual PSI-DT where both parties learn the intersection.

Server-aided PSI [21]. In [21], Kamara et al. propose a server-aided PSI relying on a semi-honest server: during a setup phase, parties jointly generate a secret key for a pseudorandom permutation (PRP). Each party then randomly permutes their set , by computing , and sends the result to the server. This then computes the intersection of the labels and returns it to all the parties. Finally, each outputs the inverse of over the intersection of the labels. The protocol is secure in the presence of a semi-honest server and honest parties, or a honest server and any collusion of malicious parties, if the PRP is secure.

Efficient Private Recommendation via Succinct Sketches [27].

A privacy-friendly recommender system based on Item-KNN 

[31] has been introduced by Melis et al. [27]. Their construction involves a “tally” server (the BBC in their application example) and a set of users (visitors of BBC’s broadcasting site iPlayer). The main goal of their system is to train the recommender system using only aggregate statistics. Specifically, they build a global matrix of co-views (i.e., pairs of programs watched by the same user) in a privacy-preserving way, by relying on (i) private data aggregation based on secret sharing (inspired by [24]), and (ii) Count-Min sketches [8] to reduce the computation/communication overhead from linear to logarithmic in the size of the matrix, trading off an upper-bounded error with increased efficiency.

If denotes the number of items, the compact representation of the IP2IP through the Count-Min Sketch has size . More precisely, given parameters , the Count-Min Sketch is a matrix of size where and . Melis et al. [27] set , yielding, e.g., for , and for .

The parameters give an upper bounded error for the estimated counters amounting to

with probability

, where is the true element. As demonstrated empirically by Melis et al. [27], the error ultimately introduces a negligible impact on the accuracy of the aggregation as well as the recommendation. Finally, the computational overhead introduced by the cryptographic operations for private aggregation, as demonstrated experimentally in [27], are in the order of seconds even with thousands of items.

Appendix B Clustering Algorithms

Agglomerative Clustering.Hierarchical Clustering algorithms build nested clusters by merging or splitting them successively. The hierarchy is represented as a tree, with the root being the unique cluster that gathers all the samples, and the leaves the clusters with only one sample. Agglomerative clustering performs hierarchical clustering using a bottom-up approach: each observation starts in its own cluster, and clusters are successively merged together. Different linkage criteria determine the actual metric used to merge, e.g., average linkage minimizes the average of the distances between all observations of pairs of clusters, while complete linkage minimizes the maximum distance between the observations of pairs of clusters.


k-means clustering separates samples in groups of equal variance, minimizing inertia or within-cluster sum of squares. The k-means algorithm requires the number of clusters to be specified as it divides a set of

samples into disjoint clusters , each described by the mean of the samples in the cluster. The means are commonly called the cluster “centroids” and the algorithm chooses centroids that minimize . The algorithm includes three steps: (1) choosing the initial centroids, often by choosing samples from ; (2) assigning each sample to its nearest centroid; and (3) creating new centroids by taking the mean value of all samples assigned to each previous centroid. The algorithm loops between (2) and (3) until the difference between the old and the new centroids is below a threshold.

k-Nearest Neighbors (k-NN). k-NN is a simple machine learning algorithm that finds a predefined number of training samples closest in distance to a new sample. The number of samples can be a user-defined constant and the distance can be any metric measure: standard Euclidean distance is the most common choice. In Section 5, we employ unsupervised k-NN to identify, for each organization, its most similar ones.