Repo for the Code of the paper titled "On Collaborative Predictive Blacklisting"
Collaborative predictive blacklisting (CPB) allows to forecast future attack sources based on logs and alerts contributed by multiple organizations. Unfortunately, however, research on CPB has only focused on increasing the number of predicted attacks but has not considered the impact on false positives and false negatives. Moreover, sharing alerts is often hindered by confidentiality, trust, and liability issues, which motivates the need for privacy-preserving approaches to the problem. In this paper, we present a measurement study of state-of-the-art CPB techniques, aiming to shed light on the actual impact of collaboration. To this end, we reproduce and measure two systems: a non privacy-friendly one that uses a trusted coordinating party with access to all alerts (Soldo et al., 2010) and a peer-to-peer one using privacy-preserving data sharing (Freudiger et al., 2015). We show that, while collaboration boosts the number of predicted attacks, it also yields high false positives, ultimately leading to poor accuracy. This motivates us to present a hybrid approach, using a semi-trusted central entity, aiming to increase utility from collaboration while, at the same time, limiting information disclosure and false positives. This leads to a better trade-off of true and false positive rates, while at the same time addressing privacy concerns.READ FULL TEXT VIEW PDF
Repo for the Code of the paper titled "On Collaborative Predictive Blacklisting"
Filtering connections from/to malicious hosts is often used to reduce network attacks and their impact. Due to the impossibility of performing expensive computations in real-time on each connection, filtering is usually done via simple look-ups, using periodically updated lists of suspicious hosts, i.e., blacklists. These can be created locally and/or by obtaining the most prolific attack sources from alert repositories such as DShield.org or DeepSight .
In , Katti et al. study the prevalence of “correlated” attacks, i.e., mounted by the same sources against different networks. They find them to be very common, and highly targeted, suggesting that real-time collaboration between victims could improve malicious IP detection time. Zhang et al.  are the first to introduce the concept of collaborative predictive blacklisting (CPB): different organizations send their logs to a central authority that, in turn, provides them with customized blacklists based on relevance ranking. In follow-up work, Soldo et al.  improve on  by replacing ranking with an implicit recommender system. Overall, collaborative approaches to threat mitigation are increasingly advocated, with more and more efforts to promote information sharing, including those proposed by CERT , RedSky Alliance , Facebook’s ThreatExchange , or the White House .
In this work, we focus on two open problems that remain largely unaddressed w.r.t. the impact of collaboration on (1) false positives/negatives, and (2) privacy. Prior work on CPB [14, 16] only focuses on measuring “hit counts”, i.e., the number of true positives, but fails to account for incorrect predictions—i.e., false positive/negatives. Moreover, real-world deployment of collaborative blacklisting is hindered by confidentiality issues, as well as trust, liability, and competitiveness concerns as sharing alerts could harm an organization’s reputation or disclose sensitive information about customers and business practices . To the best of our knowledge, the peer-to-peer model proposed by Freudiger et al. 
is the only privacy-friendly approach to the problem: organizations interact in a pairwise manner, aiming to privately estimate the benefits of collaboration, and then share data with “good” partners. However, as discussed later in this paper, it is not clear how to deploy their decentralized techniques in practice.
First, we reproduce, measure, and compare the centralized (non-private) system by Soldo et al.  vs the peer-to-peer privacy-friendly one by Freudiger et al. , using alerts obtained from DShield.org, involving 70 organizations which report an average of 4,000 daily events over a 15-day time window. We finding that the former  achieves high hit counts (almost doubling correct predictions compared to no collaboration), but its F1 accuracy is ultimately poor () due to high false positives. Whereas, the latter  allows for better control over incorrect predictions, thus resulting in a better F1 score overall (), but actually only slightly improves the hit counts over no collaboration since its peer-to-peer approach limits the amount of data that gets shared.
Our measurements lead to the intuition that, if one needs to control false positives, a controlled data sharing approach might kill two birds with one stone: (1) help organizations find a better trade-off between prediction improvement and increase in false positives, and (2) do so while actually minimizing exposure of possibly confidential data. Therefore, we introduce and analyze a novel hybrid model, relying on a semi-trusted authority, or STA, which acts as a coordinating entity to facilitate clustering without having access to the raw data. The STA clusters contributors based on the similarity of their logs (without seeing these logs), and helps organizations in the same cluster to share relevant data. Toward this goal, we perform a set of measurements to shed light on (i) how to cluster organizations, (ii) what should be shared among them, and (iii) how to measure the effect of collaboration on accuracy.
We experiment with a few clustering algorithms using the number of common attacks as a measure of similarity, which can be computed in a privacy-preserving way, and experiment with privacy-friendly within-clusters sharing strategies, namely, only disclosing the details of common/correlated attacks. Overall, we show that our new hybrid model outperforms  in terms of hit counts (4x), while achieving better accuracy than  (2x).
We gather a dataset of blacklisted IP addresses from DShield.org, a collaborative firewall log correlation system to which various organizations volunteer daily alerts. Each entry in the logs includes a pseudonymized Contributor ID (the target), source IP address (the attacker), source and target port number, and a timestamp. An example of an entry log is illustrated in Table 1.
|Contributor ID||Source ID||Source Port||Target Port||Timestamp|
With DShield’s permission, we collect logs using a web crawler, from February to September 2015, gathering, on average, 10 million logs from 120,000 organizations every day. We exclude entries for invalid or non-routable IP addresses, and discard port numbers, then, for each IP address, we extract its /24 subnet and use /24 addresses for all experiments, following experimental choices made in prior work [9, 14, 16]. This does not necessarily mean that predictive blacklisting algorithms will blacklist entire /24 subnets, since blacklisting an address does not imply blocking all its traffic, but rather subject it to further scrutiny, e.g., enforcing rate limiting or only allowing outgoing packets. Nonetheless, recall that our main goal here is to compare the impact of different collaboration approaches on prediction.
We select a 15-day period, May 17–31, 2015 and restrict our evaluations to a reasonably-sized sample of regularly contributing organizations. We select the top-100 contributors, based on the number of unique IPs reported, that also report logs every day during the 15 days and notice that most contributors (around 60) submit less than 100K logs, while fewer (around 20) submit between 100K and 500K, and only a few organizations contribute large amounts of logs (above 1M). Then, we pick 70 organizations, for each time window, leaving out the top-10 and the bottom-20 contributors. We do so, like in previous work [9, 14], to minimize bias. More specifically, the top contributors contribute a huge number of IPs (order of magnitudes more than other contributors) which might be irrelevant to most organizations, whereas, the bottom ones only report very few logs, thus adding little or nothing to the collaboration. Our final sample dataset includes 30 million attacks, contributed by 118 different organizations over 15 days, each reporting a daily average of 600 suspicious (unique) IPs and 4,000 attack events. This constitutes our “ground truth”: if an IP appears in the blacklist for an organization, it is considered to be malicious for that organization.
Note that we have also repeated our experiments on two more sets of DShield logs, using another 15-day periods (over Feb-Dec 2015), but have not found any significant difference in the results.
Notation. We use notation to denote a group of organizations, where each holds a dataset of alerts, i.e., suspicious IP addresses along with the related timestamp. We aim to predict IP addresses generating attacks to each in the next day, using, as the training set, both its local dataset , as well the set , with suspicious IP addresses obtained by collaborating with other organizations. As discussed above, we consider organizations using alerts collected from DShield.
We use Exponentially Weighted Moving Average (EWMA) to perform prediction. Given a signal over time , we indicate with the predicted value of , given past observations at time . The predicted signal is computed as:
where is a smoothing coefficient, denotes the training window, and is the time slot to be predicted. For small values of , EWMA aggregates past information uniformly across the training window, while, with a large , the prediction algorithm focuses more on events taking place in the recent past.
Throughout our evaluations, we use the following metrics to evaluate the performance of the predictions.
True and False Positives. For each time window and for each organization, we count True Positives (TP) as well as False Positives (FP). A TP occurs when the prediction algorithm includes an IP address in an organization’s predictive blacklist that does appear in its testing set, and a FP – when it does not.
False Negatives. For each time window/organization, we generate predictive whitelists, i.e., sets of IPs that are not likely to attack an organization the next day, and count a False Negative (FN) when a whitelisted IP address instead appears in the testing set.
TP Improvement and FP/FN Increase. We also measure the average improvement/increase in TP, FP, and FN when compared to a baseline local approach, i.e., when no collaboration occurs between organizations and each of them makes its predictions based only on its local dataset. The improvement in TP is calculated as: where is the number of true positives after collaboration and TP without. Similarly, the increase in FP and FN is denoted, resp., as and .
Precision, Recall and F1-Score. We calculate the True Positive Rate (TPR), aka recall, False Positive Rate (FPR), as well as Positive Predictive Value (PPV), aka precision, defined as: TPR=TP/(TP+FN), FPR=FP/(FP + TN), PPV=TP/(TP + FP), and derive the F1 measure, i.e.,
Remarks on FP: The absence of an IP from our testing set can occur either when the IP is not considered suspicious or if it does not generate requests. While we cannot actually distinguish between the two cases, in the latter a FP is actually less “severe” than in the former, thus our FP count may be a bit more conservative. However, our main goal is really to measure and compare with each other the impact of different collaboration strategies on predictions so we use this method without loss of generality.
We first evaluate Soldo et al ’s CPB approach based on implicit recommendation. We do so aiming to: (1) evaluate false positives and false negatives, which were not taken into consideration in , and (2) compare against privacy-friendly approaches, presented later. Essentially, Soldo et al.’s work builds on , which bases on a relevance ranking scheme similar to PageRank, measuring the correlation of an attacker to a contributor relying on their history as well as the attacker’s recent log production patterns. Soldo et al. significantly improve on this, by using an implicit recommendation system to discover similar victims as well as groups of correlated victims and attackers. The presence of attacks performed by the same source around the same time leads to stronger victim similarity, and a neighborhood model (k-NN) is applied to cluster similar victims. Cross Association (CA) co-clustering  is then used to discover groups of correlated attackers and victims, and prediction within the cluster is done via the EWMA time series algorithm (TS) to capture attacks’ temporal trends. In other words, the prediction score for each organization is a weighted ensemble of three methods (TS, k-NN and CA). We have re-implemented their system in Python, using Chakrabarti’s CA implementation .
We start by measuring the basic predictor which only relies on a local EWMA time series algorithm (TS), using as it yields the best results, then, apply the co-clustering techniques (TS-CA), and, finally, implement their full scheme by combining k-NN to cluster victims based on their similarity with CA and TS (TS-CA-k-NN). Fig. 1 illustrates the improvement/increase in TP, FP, FN (compared to the TS baseline) as well as TPR, PPV, and F1, with various values (ranging from 1 to 35) used by the k-NN algorithm to discover similar organizations. Obviously, the k-NN parameter does not affect TS-CA and TS.
Fig. 0(a) shows that, with TS-CA-k-NN, increases significantly with , almost doubling the “hit count” compared to the TS baseline, whereas, TS-CA improves less (). On the other hand, however, there is too, 5- to 50-fold, as clusters become bigger (Fig. 0(b)), and naturally, this stark increase in FP leads to low precision, as shown in Fig. 0(e). FNs also always increase compared to TS (Fig. 0(c)), specifically, they double with TS-CA and increase between and (less for larger values) compared to TS. also affects TPR (Fig. 0(d)), with an increase between and . The does not correspond to a comparable increase in TPR, due to the poor FN performance, as shown by the fact that TS-CA-k-NN reaches in but only at most TPR compared to with the baseline TS. Overall, Soldo et al.’s techniques achieve poor F1 measures, at most and , with TS-CA and TS-CA-k-NN, actually lower than a simple local time-series prediction ().
Next, we evaluate the privacy-friendly peer-to-peer approach to CPB by Freudiger et al. . Organizations interact pairwise, aiming to privately estimate the benefits of collaboration, and then share data with entities that are likely to yield the most benefits. They also use DShield data and perform prediction using EWMA. They find that: (1) the number of common attacks is the best predictor of benefits, which can be estimated privately, using Private Set Intersection Cardinality (PSI-CA) ; and (2) sharing only the intersection of attacks – which can be done privately using Private Set Intersection (PSI)  – is almost as beneficial as sharing everything. Their goal is really to assess benefit estimation/sharing strategies, rather than to focus on deployment. They assume a network of 100 organizations, select the “top 50” among all possible 4950 pairs (in terms of estimated benefits), and only experiment on those. Naturally, without a coordinating entity, it is impossible to rank the pairs, so they suggest that one should collaborate with either organizations when estimated benefits are above a threshold, although it is not stated how to set this threshold; or with the top organizations with the biggest estimated benefits, but do not experiment with or discuss how impacts overhead or true/false positives. We replicate both approaches: (A) with the top 1% to 5% of global pairs, and (B) having each organization pick to most similar organizations.
Fig. 2 shows the improvement/increase in TP, FP, FN (compared to a baseline with no sharing) as well as TPR, PPV and F1 with increasing percentage of global pairs (A). We omit plots for approach (B) since they are worse across the board, although we discuss them next. Looking at , (A) yields increase when of global pairs are selected whereas for (B), i.e. picking local pairs, increases along with the number of local pairs selected. (A) has a rather small ( increase when the top pairs are selected) compared to (B) which is affected by the number of pairs that each organizations picks for collaboration. When an organization collaborates with 5 others a is observed on average while when it collaborates with 30 others reaches . Moreover, we find both approaches achieve a decrease in false negatives with the second approach achieving bigger decreases as the number of collaborators increases.
Overall, both approaches improve precision and recall of the system, yielding higher F1 scores compared to a local approach. Although the increase in TP is not as high as with the non-private approach of, a more balanced increase of false positives and a decrease of the false negatives seems possible. However, the system is limited in the amount of new information organizations learn (e.g., only events about IPs they have already seen is shared) as well as scalability, since both the computation of the metrics and the actual data sharing are conducted pair-wise (if there are collaborating entities, the complexity of the data sharing would be ).
Centralized state-of-the-art CPB techniques  have only focused on improving “hit counts,” but, as shown above, they generate very high false positive rates. In practice, organizations might not adopt such solutions if they generate a large number of false alarms. Naturally, one could design better centralized approaches that yield better accuracy, e.g., by learning to discard the data that yield false positives. However, our intuition is that in this case a privacy-preserving approach might be best suited as it can (i) help organizations find a better trade-off between prediction improvement and increase in false positives, and (ii) do so while actually minimizing exposure of possibly confidential data.
Overview. To this end, we introduce a novel hybrid system which relies on a semi-trusted authority, or STA, acting as a coordinating entity to facilitate clustering without having access to the raw data. In other words, the STA clusters contributors based on the similarity of their logs (without accessing these logs), and helps organizations in the same cluster to share relevant logs.
The system involves four steps. (1) First, organizations interact in a pairwise manner to privately compute a similarity measure of their logs, based on the number of common attacks (similar to 
). Then, (2) the STA collects the similarity measures from each organization and performs clustering using one of three possible algorithms, i.e., Agglomerative Clustering, k-means, or k-NN.111Note that, to ease presentation, we do not plot results using Agglomerative Clustering because it yields the worse results. Next, (3) the STA reports to each organization the identifiers of other organizations in the same cluster (if any), so that they collaboratively, yet privately, share logs to boost the accuracy of their prediction, by either sharing common attacks (intersection), correlated attacks (IP2IP), or both. For comparison, we also consider baseline approaches, i.e., sharing nothing (local) or sharing everything (global). Finally, (4) each organization performs EWMA prediction (again, with , as done in our evaluation of ). based on their logs, plus those from entities in the same cluster. This approach is hybrid in that, while involving a central authority, data sharing is privacy-friendly: in (1) the number of common attacks can be computed using PSI-CA , while in (3) sharing of common attacks can occur using PSI  and of correlated attacks using .
Settings. We once again use datasets and settings from Section 2. Also, for the IP2IP method, we only consider the top-1000 attackers (i.e., the top-1000 heavy hitters) in each cluster, for each 5-day training-set window, rather than looking for correlations over all the /24 IP space. We fix the value for the k-NN based recommendation to 50, as it provides the best results in our experiments.
k-means. Next, we use k-means for clustering and decide to restrict to stronger correlations
, by only taking into account organizations closer to the cluster’s centroid, and excluding the rest of them as outliers. We set a distance threshold and choose the value that yields the best result, i.e., the cluster distance value below which 40% of the organizations can be found. Fig.2(a)–2(c) plot the average improvement in TP and increase in FP and FN. is almost constant with IP2IP () independent of the cluster sizes, while with the other methods it decreases faster due to the distance thresholds, ranging from with global for to of intersection for . IP2IP shows steady values compared to other methods (, i.e., a decrease) which leads to a better performance in TPR, as shown in Fig. 2(d), for (up to ). Furthermore, intersection yields the best performance in (), with . Fig. 2(f) shows the best F1 measure () is reached with , due to a peak both in PPV and TPR. IP2IP performs slightly worse () than local () while poor F1 values for global, with , () are due to its bad PPV () – see Fig. 2(e).
k-NN. Recall that indicates the number of nearest neighbors that each entity considers as its most similar ones. Thus, organizations can end up in more than one neighborhood. Since the algorithm builds a neighborhood for each organization, not all clusters have the same strength, so we only consider strong clusters in terms of their members similarity and as done with k-means, after tuning the parameters, we set a distance threshold as the 40th percentile to leave possible outliers out of the clusters. From Fig. 3(a), we observe that IP2IP+intersection yields the second best performance in (, with ), while global peaks at . In terms of , IP2IP doubles it (for ), while intersection achieves the lowest value with (again, for ). As with previous clustering algorithms, we notice that intersection yields the best decrease in FN, i.e., with . Intersection also achieves the highest TPR (up to ) with larger cluster sizes (i.e., for , while its combination with the IP2IP reduces it () – see Fig. 3(d). Fig. 3(e) shows that intersection has the best PPV ( for ), similar to local (), while IP2IP performs worse () due to higher (almost doubling the FP for ). Finally, from Fig. 3(f), note that intersection yields the highest F1 ( for ).
Summary of results. We summarize the best results for each clustering algorithm, in terms of best F1, recall, precision, and in Tables 3–5. We note that intersection is that sharing mechanism that maximizes all metrics, except for , which is instead maximized with IP2IP+intersection. Both k-means and k-NN peak at in F1 including, respectively, and collaborators over all time windows. Agglomerative clustering involves all contributors and achieves . k-NN with yields the best results for TPR (), while both k-NN with and k-means with achieve in PPV. In terms of , k-means reaches a maximum of with and clusters of size on average, selecting collaborators overall. Slightly lower improvements are achieved with other clustering algorithms, but with more collaborators benefiting from sharing, as well as fewer FP.
Data sharing always helps organizations forecast attacks, compared to performing predictions locally. Predicting based on all data from collaborators yields the highest improvement in – especially for bigger clusters – but with a dramatic increase in . When organizations share correlated attacks (IP2IP), we observe a steady , while sharing common attacks (intersection) outperforms the former when bigger clusters are formed. However, intersection introduces lower , ultimately leading to better precision and F1 measures. IP2IP+intersection always outperforms the two separate methods in terms of , thus, it is the recommended strategy if one only wants to maximize the number of predicted attacks.
|Setting||Max TPR [Sharing Intersection]|
Impact of cluster size. With agglomerative clustering, each organization is assigned to exactly one cluster and thus participates in/benefits from collaboration. We observe higher TPR for bigger clusters and, generally, a stable improvement in TP is achieved on average. Similar results are obtained with k-means when all organizations are assigned to clusters. However, when we set a distance threshold, creating more consistent clusters, we observe fluctuations in TPR: as clusters get smaller much faster (in relation to value), IP2IP starts outperforming intersection. This indicates that correlated attacks can improve knowledge of organizations and enhance their local predictions, especially in smaller clusters. With k-NN, a different behavior is observed: for smaller clusters, IP2IP achieves higher TPR (up to for ) but, as clusters get bigger, intersection yields the best results (up to for ). Overall, collaborating in big clusters leads to high but at the same time it introduces significant .
Increase/Improvement in TP/FP/FN. We observe that, for all clustering algorithms, maximizing always leads to higher , from of k-NN up to of Agglomerative. The settings that maximize the F1 measure, TPR, and PPV, (when sharing intersection) also minimize , e.g. agglomerative with achieves . In general, we observe that (privacy-friendly) collaboration does yield a remarkable increase in TP but also in FP, which results in a limited improvement in F1 score compared to predicting using local logs only.
Overall, our measurements allow us to quantify how different collaboration strategies affect prediction in terms of increasing true positives, false positive, and false negatives, and in general precision, recall, and F1. Ultimately, the main goal is to find settings that improve TP while keeping the increase in FP as low as possible. In this context, the best approach is sharing common and correlated attacks (IP2IP+intersection) with k-NN (see Table 5).
Hybrid approach vs state of the art [9, 14]. When comparing the hybrid approach to Soldo et al. , we observe that  achieves higher maximum ( vs with k-means, ). However, our privacy-preserving techniques outperform  in terms of recall (TPR) (e.g., with k-NN we reach compared to their , i.e. up to 18% increase) as well as precision ( with k-means, vs , i.e. up to 15% increase) and F1 measure ( with k-NN, vs ). Finally, comparing the hybrid approach to the former yields better results in terms of (0.61 for k-means, vs 0.13 for top 3% of global pairs) and TPR (0.77 for k-NN, vs 0.66 for top 1% of global pairs), but similar F1 score (0.30 vs 0.28), due to the latter’s smaller increase in FP.
Overall, we conclude that a controlled data sharing approach, compared to a centralized one, helps organizations find a better trade-off between prediction improvement and increase in false positives, while minimizing exposure of possibly confidential data.
|Setting||Max TP Improvement [Sharing IP2IP+intersection ]|
As discussed above, our hybrid system involves four steps: (1) secure computation of pairwise similarity, (2) clustering, (3) secure data sharing within the clusters, and (4) time-series prediction. To assess its scalability, we need to evaluate computation and communication complexities incurred by each step. Naturally, steps (1) and (3) dominate complexities as they require running a number of cryptographic operations (involving public-key crypto) that depends on the number of organizations involved. In fact, clustering incurs a negligible overhead: on commodity hardware, to perform clustering with 1,000 organizations, it takes for k-means, for agglomerative and for k-NN ().Also, time-series EWMA prediction requires per IP, so it takes for 1,000 IPs. As we compute pairwise similarity based on the amount of common attacks between two organizations, and support its secure computation via PSI-CA , step (1) requires a number of protocol runs quadratic in the number of organizations. In our experiments, it takes and MB bandwidth for one protocol execution, using 2048-bit moduli, with sets of size 4,000 (the average number of attacks observed by each organization). As for (3), i.e., secure within-cluster sharing of events related to common attacks (intersection), we rely on PSI-DT , and it takes and MB for a single execution with the same settings. Therefore, complexities may become prohibitive when more organizations are involved or more alerts are used.
Aiming to improve scalability, we also implement a variant supporting secure computation of pairwise similarity as well as secure log sharing without a quadratic number of public-key operations/quadratic communication overhead. Recall that we rely on a semi-trusted authority, STA, for clustering and coordination, which is assumed to follow protocol specifications and not to collude with other organizations, thus, we can actually use it to also help with secure computations. Inspired by Kamara et al.’s server-aided PSI , we extend our framework by replacing public-key cryptography operations with pseudo-random permutations (PRP), which we instantiate using AES. Specifically, we minimize interactions among pairs of organizations so that the complexity incurred by each of them is constant, while only imposing a minimal, linear communication overhead on STA.
Our extension involves four phases: (i) setup where, as in , one organization generates a random key and sends it to the other organizations, (ii) encryption (see Algorithm 1), where each organization evaluates the PRP on each entry in their sets and encrypts the associated timestamp , (iii) O2O computation (see Algorithm 2), where STA computes the magnitude of common attacks between each pair of organizations in order to perform clustering, and (iv) log sharing (see Algorithm 3), where organizations in the same cluster receive information about common attacks (-s). Note that building the O2O matrix is actually optimized using hash tables (i.e., dense_hash_set and dense_hash_map from Sparehash. Also, since sets in our system are multi-sets, we concatenate counters to the IP address, so that the STA cannot tell which and how many IPs appear more than once.
Experimental Evaluation. We benchmark the performance of PSI-CA  and PSI-DT  using 2048-bit moduli, modifying the OpenSSL/GMP-based C implementation in , as well as the PRP-based scheme presented above and inspired by Kamara et al.’s work . Experiments are run using two 2.3GHz Intel Core i5 CPUs with 8GB of RAM connected via a Mbps Ethernet link. Figures 4(a) and 4(b) plot computation and communication complexities incurred by an individual organization vis-à-vis the total number of organizations involved in the system, while Fig. 4(c) reports the communication overhead introduced on the STA-side for the PRP scheme. As expected, complexities for PSI-CA/PSI-DT protocols on each organization grow linearly in the number of organizations (hence, quadratic overall). For instance, if 1,000 organizations are involved, it would take about 16 minutes per organization, each transmitting 1GB. Whereas, the PRP-based scheme incurs constant complexities on each organization ( and KB) and a low communication overhead on the STA (about 100MB) for 1,000 organizations (Fig. 4(c)).
We also evaluate the IP2IP method whereby organizations interact with STA in order to discover cluster-wide correlated attacks. Assuming clusters of organizations and an IP2IP matrix of (recall we consider the whole /24 IP space), we measure a running time per organization with KB of bandwidth as well as a overhead on the STA with MB bandwidth. Using the private Count-Min sketch based implementation by Melis et al. , we can compress to a logarithmic factor with a small, bounded loss, and the private aggregation is done over 10,336 elements. Even if clusters are bigger than 100, as detailed in , one can still perform private aggregation on multiple subgroups (e.g., of size 100) without endangering organizations’ privacy.
Security. Protocols do not leak any information about the logs of each organization to the STA, with or without using the server-aided variant. Clustering is performed over similarity measures computed obliviously to STA, and so does within-cluster data sharing. Privacy-preserving computation occurs by using existing secure protocols such as PSI-CA/PSI-DT by De Cristofaro et al. [7, 6]), server-aided PSI by Kamara et al. , as well as private recommendation via succinct sketches by Melis et al. . Therefore, we do not provide any additional proofs in the paper as the security of our techniques straightforwardly relies on that of these protocols.
This paper presented the result of a measurement study of collaborative predictive blacklisting (CPB). We evaluated a number of metrics on a real-world dataset obtained from DShield.org, aiming to shed light on the effects of collaboration when considering two state-of-the-art approaches, one, non privacy-preserving, relying on trusted central party  and another peer-to-peer using privacy-preserving data sharing . We also introduced a third, hybrid approach that aims to combine the best of the two worlds.
Naturally, having access to more logs does not necessarily result in better predictions. In fact, our experiments showed that the techniques proposed by Soldo et al.  achieve impressive hit counts (almost doubling the number of correct predictions compared to local predictions) but suffer from poor precision due to high FP. On the other hand, the privacy-friendly decentralized system proposed by Freudiger et al.  achieves better F1 scores overall, although with a decreased improvement in TP. Finally, our analysis shows that our hybrid approach outperforms both approaches, balancing out true and false positives, while maintaining privacy protection.
As part of future work, we plan to conduct a longitudinal measurement to fully grasp the effectiveness of privacy-enhanced CPB in the wild, apply our methods to other datasets, and experiment with more advanced machine learning techniques to improve overall performances.
We provide detailed information for researchers wishing to reproduce the experimental results presented in this paper. Please install Python 2.7 as well as the following Python packages: numpy 1.14.0, scipy 1.0.0, scikit-learn 0.19.1, pandas 0.22.0 and matplotlib 2.0.0. All of the above packages can be installed via “pip”.
Code. Source code is available at the following git repository:
Dataset. To obtain the DShield dataset that was used in our experiments, use the following download link and extract its contents (i.e., the .pkl files) in the ‘data’ folder of the cloned repository.
Soldo et al . To replicate the experiments for Soldo et al ’s implicit recommendation system, we also need the MATLAB implementation of Chakrabarti et al.  for the Cross Associations (CA) co-clustering algorithm. To this end, one should install Octave 4.0.0 as well as the Python package oct2py 3.5.0 (which can also be installed via “pip”). First, to compile the Cross Associations algorithm follow the steps:
Then, to link our Python implementation with the Octave workspace of CA configure accordingly the path in the file ‘collsec/soldo/CA_python/ca_utils.py’ (line 6). Finally, to run the experiments of Section 3.1:
Note. To configure the parameter of the k-NN algorithm included in the ensemble method of  modify the file ‘collsec/soldo/top_neighbors.py’ (line 4). Moreover, if experiments for various values of are executed, modify the file ‘collsec/soldo/soldo.py’ (line 41) to prevent the CA algorithm from running again.
Note. To configure the length of the training and testing windows for the system of Freudiger et al. , modify the file ‘collsec/utils/dimva_util.py’.
Hybrid Approach. To launch the experiments for our proposed hybrid scheme (see Section 4) please execute the following steps:
Note. Our implementation by default is configured to utilize a 5-day training window and a 1-day testing one as done in previous work [9, 14]. If one wants to change this setting, please adjust the parameters indicated in the files ‘collsec/utils/util.py’ and ‘collsec/utils/time_series.py’.
Results. The results of all the above scripts are stored in the folder titled ‘collsec/results’. To visualize the results and obtain the figures presented in the paper, type the following commands: