1 Introduction
Securing modern networks is a nontrivial challenge that requires high throughput (to evaluate in realtime the behavior of the network) and expert knowledge (to decide whether suspicious or malicious activity is taking place on the network). Network intrusion detection, in particular, is concerned with the early detection of attempts of breaking into and/or comprising a network. Packet collection techniques allows the continuous monitoring of networks and the collection of large amounts of traffic data.
Modern network intrusion detection data sets quickly exceeds the processing capacity of human reviewers, and thus automatic processing techniques for filtering and analyzing the data become necessary. A simple solution consists in eliciting knowledge from experts and encoding it into logical rules. Such rulematching or signaturematching algorithms can process the data quickly, but their effectiveness is limited by the sort and the range of knowledge provided by the experts; in particular, these algorithms lack any sort of generalization and are unable to deal with novel data that do not perfectly match the encoded rules (Gardiner and Nagaraja, 2016). Machine learning has been put forward as a potential solution to this shortcoming. By learning directly from the data, and not from the particular knowledge provided by experts, machine learning algorithms aim at learning patterns that may hold not only for historical collected data, but also for future unforeseen data.
Many current machine learning techniques rely on large amount of data to learn useful patterns. The success of deep learning, in particular, has been explained by, among other factors, the availability of big data sets (LeCun et al., 2015). However, large data sets have also significant drawbacks. Large collections of data are problematic to archive and to store on drives; they are challenging to manipulate and load, requiring either a high amount of memory or frequent swapping operations; they are often redundant, to the point that this redundancy contributes little or nothing to the learning process.
Beyond deep learning, other machine learning algorithms may be severely affected in a negative way by large redundant data sets. This is the case, for instance, of Bayesian machine learning
. Bayesian machine learning provides a rigorous framework for performing inference over data. It allows the estimation of complete probability distributions, a precise evaluation of uncertainty, the possibility of neatly integrating prior expert knowledge in the learned model, and the ability to update the learned model when provided with new data. However, all these possibilities come at a high computational cost, as standard Bayesian learning relying on
Markov chain Monte Carlo (MCMC) algorithms do not to scale well with respect to the size of the data.In presence of large redundant data sets, a possible solution to make inference via MCMC feasible consist in the reduction of the number of samples used to learn. Simple solutions include statistical techniques like random sampling
or unsupervised learning algorithms such as
clustering via kmeans
Bishop (2006). A more exact approach is based on the idea of creating coresets: instead of learning on the whole redundant data sets, it may be possible to define a (weighted) subset of samples which is probabilistically guaranteed to return a result close to the one that would be obtained by processing the whole data. In Bayesian machine learning, Campbell and Broderick (2017, 2018) recently proposed an efficient and promising algorithm to learn Bayesian coresets in Hilbert space (BCH) that computes a weighted subset of samples by smartly exploiting the structure of that space.As many other applications, network intrusion detection could take advantage of Bayesian data analysis. A careful estimation of uncertainty when evaluating the possibility of a threat on the network is critical in order to take decisions. The possibility of integrating expert knowledge and update models are also important features in the complex and constantly changing environment of networks. Unfortunately though, the amount of data collected by capturing packets on a network quickly exceeds the feasibility of performing Bayesian machine learning via MCMC. By filtering redundant data and reducing the amount of samples via BCH, network traffic data sets could be reduced in size (thus offering a concrete benefit for storing and management) and processed using Bayesian techniques (thus producing more complete and versatile results).
In this paper we conduct a preliminary study of the possibility of applying BCH to network intrusion detection data and perform Bayesian machine learning via MCMC. We consider two main problems. (i) How effective is the use of BCH to learn models of network intrusion?
We address this question considering (subsets of) realistic network traffic data and evaluating the effectiveness of BCH on a simple supervised learning problem. While an evaluation of BCH on computer security data (phishing data sets) is already provided in
Campbell and Broderick (2017, 2018) in terms of metrics of posterior quality, we offer here an analysis in terms of accuracy, which is more relevant to the field of cybersecurity. In particular we consider our results in light of the tradeoff between timespace and accuracy and with respect to the sensitivity of BCH to its hyperparameters. (ii) How effective is the use of BCH to reduce the amount of data in a streaming environment? We answer this question by considering the same realistic network data, but setting up a more challenging scenario in which the data samples are received sequentially. In this context, we analyze, once again in terms of accuracy and timespace savings, the advantages that BCH may bring when processing the data in realtime, upon arrival. Our results confirms that the tradeoff between accuracy and timespace savings when using BCH is mainly regulated by one of the free hyperparameters of BCH. Moreover, we show that the algorithm could be successfully used in a streaming environment, where it succeeds in sensibly reducing the computational time over several iterations and in ensuring good performances by aggregating coresets over the same iterations.On the side, while tackling these questions, we also offer a practical contribution in the form of a porting of BCH algorithms^{1}^{1}1https://github.com/trevorcampbell/bayesiancoresets (Campbell and Broderick, 2017) into the framework of the probabilistic programming library Edward (Tran et al., 2016). Specifically, we adapt the original code for coreset computation to work with Edward models, thus exploiting the probabilistic programming features of Edward^{2}^{2}2http://edwardlib.org/
and the automatic differentiation feature of Tensorflow
^{3}^{3}3https://www.tensorflow.org/. Code for this implementation is available online^{4}^{4}4https://github.com/FMZennaro/BayesianCoresetsEdward.The rest of the paper is organized as follows. Section 2 briefly describes Bayesian machine learning and BCH. Section 3 explains the problem of network intrusion detection. Section 4 introduces our experimental setup. Section 5 tackles our first research question by analyzing the use of BCH on network traffic data. Section 6 deals with our second research question by evaluating the use of BCH in a streaming environment. Finally, Section 7 summarizes our results and presents some of the several avenues available for further development of this work.
2 Background
In this section we first introduce our general notation for the learning problem. We review the Bayesian approach to learning and its limitations. We then explain how Bayesian coresets deal with the problem of scalability. Finally, we review alternative approaches to work around the computational challenges of Bayesian learning.
2.1 Notation
In the following, we will deal with standard supervised learning problems. We consider a data matrix of dimension , containing samples described by features; a sample
is a vector of dimension
. We also assume we are given a label vector of dimension , such that for each sample we have a label . Our aim is to learn a model mapping samples to labels: , where is a set of parameters defining the mapping function .The standard approach of machine learning is to convert this learning problem in an optimization problem as a function of the parameters . The optimal solution is found by computing the point estimate of the parameters. For each input sample we can then compute the output as . The result (which can be interpreted probabilistically if calibrated (Shalizi, 2013)) is the output of the single model on which we invested all our trust.
2.2 Bayesian machine learning
In Bayesian machine learning we tackle the problem of supervised learning with the aim of computing a full distributional estimation of the parameters , instead of a point estimation. In this way, for each input sample we can compute a distribution over the possible outputs . This result represents the probability distribution of the output, computed considering all possible values of the parameters scaled by the trust assigned to them.
More formally, in Bayesian machine learning we estimate the posterior distribution of the parameters given the data using Bayes’ formula:
where is the prior probability distribution over the parameters, is the likelihood function of the data with respect to the parameters, and is the evidence.
Computing the posterior distribution is a challenging task that requires the evaluation of the product of likelihood function and prior distribution , and the evaluation of the evidence integral. Monte Carlo Markov chain (MCMC) algorithms are a practical solution to this problem based on the idea of sampling from the posterior distribution (Givens and Hoeting, 2012). The main drawback of this approach is the computational scalability as the complexity of sampling a posterior point grows linearly with the size of the data (Campbell and Broderick, 2017).
2.3 Bayesian coresets
Evaluating the posterior distribution via MCMC sampling requires the computation of the likelihood . Under the assumption of independent and identically distributed data, the likelihood for the whole data set may be factorized in the product of the likelihoods of individual data points :
or, equivalently, in the product of loglikelihoods:
Bayesian coresets compute a small weighted subset of the original data such that the loglikelihood computed on approximates the loglikelihood computed on :
where are samples belonging to the coreset and are the associated weights. The degree of approximation may be evaluated in terms of distance between the the original loglikelihood and the coreset likelihood:
(1) 
Estimating this distance is challenging, and an approximation is offered by Huggin’s algorithm (Huggins et al., 2016).
Bayesian coresets in Hilbert spaces.
A refinement of this solutions has been proposed by Campbell and Broderick (2017), with the suggestion of embedding loglikelihoods in a Hilbert function space. This reformulation has several advantages. First, by taking the objects of this space to be functions of the form , loglikelihoods or become vectors of this space; consequently, the total likelihood over the whole data set or the coreset can be expressed in terms of vector sum. Second, by taking as a norm of the space a bounded sup norm , we can restate the constrained problem in Equation 1 as a sparse quadratic minimization problem:
under the constraints:
where is the identity function that returns if holds or otherwise, and is a maximum number of coreset samples that we allow selecting. This fomalization turns the problem of constructing a coreset into an optimization problem aimed at finding the minimal set of samples that approximates the loglikelihood on the data set. Finally, the structure of the Hilbert space allows us to exploit the directionality of the space in order to account for residual errors between and and to better select samples that would improve the approximation. Algorithms for coreset construction that exploit the properties of the Hilbert space include the coreset construction based on FrankWolfe algorithm (Campbell and Broderick, 2017) and the GIGA algorithm (Campbell and Broderick, 2018).
Model dependency of Bayesian coresets.
As it has been underlined by Coleman et al. (2018), it is important to remark that a coreset computed by a BCH algorithm is tightly connected to a specific family of models. Such a coreset does not constitute a generic weighted nonredundant distillation of the original data set; it is a subset of the original data optimized with respect to a specific family of models in order to produce a posterior distribution as close as possible to the one that we would learn from the original data set. In sum, a BCH is actually a tuple made up by a family of models and a weighted set of samples.
2.4 Alternative approaches to BCH for Bayesian learning
BCH is just one of the possible approaches to make Bayesian machine learning feasible on large data sets. Other approaches which do not involve reducing the number of samples include variational Bayes, parallel MCMC and approximate MCMC. Variational Bayes algorithms forgo the idea of using MCMC algorithms to perform inference, and rely instead on variational approximations of the posterior (Bishop, 2006). The variational approach allows learning in presence of large data sets, but the method does not provide guarantees on the degree of approximation of the uncertainty of the posterior (Giordano et al., 2015)
. Parallel MCMC algorithms rely on parallelization: large data sets are divided among multiple clusters; each cluster runs locally Bayesian inference via MCMC and produces a posterior distribution; finally, all the posteriors are aggregated by finding a unique posterior in the metric space of the posterior distributions
(Neiswanger et al., 2013; Srivastava et al., 2015). The parallel approach allows to deal with large data sets, but it still requires a high computational budget and does not address the problem of storing redundant data. Finally, approximate MCMC aims at speeding up existing algorithms by replacing costly transition in the Markov chain process with approximations (Johndrow et al., 2015). Again, this approach is effective when we have to process large data sets, but it requires analyzing the execution of the MC algorithm, and, once again, it does not consider the problem of storing redundant data.3 Network Security
Network security is one of the main challenges in the management of online systems. Network administrators try to monitor and prevent malicious activity through the deployment of intrusion detection systems, the collection of network traffic, and the analysis of this data (Northcutt and Novak, 2002). Processing these data in a timely manner in order to detect suspicious activity as early as possible is a crucial problem. The ease with which large amount of data can be collected on a network poses severe scalability problems, both in terms of storage and in terms of processing (Gardiner and Nagaraja, 2016)
. Given this constraint, computationallycheap algorithms, such as signaturematching, random forests and support vector machines, have been favored; for a review of pattern recognition and classical machine learning algorithms applied to network security, see, for instance,
Gardiner and Nagaraja (2016) and GarciaTeodoro et al. (2009).4 Experimental Setup
In this section we provide a formal description of our study by defining the exact learning problem we considered, by presenting the data sets and the transformations we applied to them, and, finally, by discussing the models we implemented.
Problem definition
Given a large data set for network intrusion detection, we express our learning problem as a supervised learning problem in which we try to discover a function that maps network flows to an output defining whether a flow is malicious or not. More precisely, we try to infer an optimal set of parameters that define the mapping function . In a first static scenario, we process our data with and without BCH, running the simulations multiple times ad observing the contribution of the BCH algorithm. In a second scenario, we simulate the progressive collection of large chunks of data. All the data samples are taken to be independent and identically distributed. In this case, we observe what the contribution of BCH would be if we were to filter out data as soon as they are collected.
Network data set.
To run our experiments we use the network traffic data collected in the CICIDS2017 data set (Sharafaldin et al., 2018). Processing realworld network data presents challenges from a privacy perspective; for this reason, the CICIDS2017 was collected running a simulated network designed to behave in a realistic fashion. Five days of simulated traffic were collected; during each day, different types of attacks and malicious behaviors were enacted. Network packets are gathered and aggregated in network flows. In total, the data set contains more than 2.5 million samples. Each sample is defined by a dimensional vector reporting features such as packet flags and packet lengths. Finally, a binary label has been assigned to each network flow, denoting whether a flow is legitimate or not. We restrict our attention to the second day, Tuesday^{5}^{5}5Notice that we did not consider the data on Monday because no malicious activity takes place on this day, thus providing us only with positive instances., which is made up by 445708 data samples, of which 13835 constitute instances of brute force attacks.
Network data preprocessing.
The CICIDS2017 data set contains a very limited number of samples (201) with missing values for the day of Tuesday. Given their limited number we simply assume that they are missing completely at random (Barber, 2012)
and we just drop them. Also, before each experiment we always standardize the data to zero mean and unit variance by feature. Standardization parameters are always computed exclusively on the training data being processed.
Data set subsampling.
For the purpose of this preliminary study, we evaluate our algorithms only on limited subsets of the large CICIDS2017 data set. Studying smaller data sets guarantees some advantages: (i) it allows us to compare Bayesian learning on coresets with the ”ground truth” of learning on the whole data sets, which would not be feasible if we were to consider the entire data set; (ii) manipulating the data set allows us to simulate scenarios in which we receive streaming data. While we plan to extend our evaluation to the full data set in order to assess the true potential of BCH for network intrusion detection, we were still able to get useful insights by studying its application on more modestsized subsets of the original data set. Thus, from the large original pool of data, we programmatically create subsets by random sampling. In order to preserve the imbalance between positive and negative instances in the training data, we always select ten times as many positive samples as negative samples. The prototypical training data set consists of 800 positive instances and 80 negative instances. For a test data set, instead, we selected an even number of positive and negative instances, thus simplifying the interpretation of the results. Our prototypical test data set consists of 200 positive an 200 negative samples.
Sample reduction techniques.
Given a training data set , in order to reduce the amount of data samples, we apply BCH using the GIGA algorithm (Campbell and Broderick, 2018). This algorithm has two free hyperparameters: (i) the number of random dimension on which to project the samples; and (ii) the number of computational iteration , which implicitly limits the maximum number of coreset samples that can be selected.
Network models.
For our simulations, we consider two models:

Bayesian logistic regression (BLR):
a discriminative generalizedlinear Bayesian machine learning algorithm (Bishop, 2006). We define our weighted BLR model as:where is a Gaussian prior, and the likelihood is the likelihood under a Bernoulli pdf scaled by the weight associated to the sample . If the data samples are not weighted, then for all , and the model reduces to a standard BLR. Notice that, in general, if all the samples are scaled by a constant value the inference process will not change^{6}^{6}6We take advantage of this constant scaling to prevent overflowing errors in the simulations..
Given a new sample , its probability of belonging to a class can be obtained by integrating over all the models under the posterior:

Support Vector Machine (SVM): as a baseline and comparison, we consider support vector machine, a linear maximummargin discriminator (Cortes and Vapnik, 1995). We train a SVM model to find the slope of a discriminating hyperplane between samples^{7}^{7}7
Notice that for a fair comparison with the BLR model, we compute only the slope of the discriminating hyperplane and not its intercept.
.Given a new samples , its class is computed as , where is the sign function, if , otherwise.
5 Simulation 1: BCH Applied to Network Intrusion Detection Data
In this simulation we analyze the use of BCH applied to network intrusion detection data. We evaluate the contribution they provide both from the point of view of the performance they achieve and the timespace they save. We compare these results to the baseline offered by SVM and by a BLR computed over the whole data set.
Protocol.
We generate five subsets of training data and test data , , using the methodology presented in Section 4.
We apply BCH to each training data set . We set the free hyperparameters of BCH as follows: (i) we fix the number of random dimension to , following the experimental evaluation in Campbell and Broderick (2017); (ii) we consider three values for the number of computational iteration , that is , following again the evaluation in Campbell and Broderick (2017), , and an aggressively lower value of , which is expected to guarantee a higher saving in terms of space and time.
For each subset , we train and test an SVM model, a BLR model trained on the whole data set , and a BLR model trained on the coreset computed from the training data .^{8}^{8}8Notice that we do not train the SVM model on the coresets because, as discussed in Section 2.3, coresets are not generic nonredundant subdata sets, but they are subselections optimized for a specific statistical model. The SVM model is trained with default parameters from the scikit^{9}^{9}9https://scikitlearn.org/stable/ library. The BLR models are trained using the Hamiltonian Monte Carlo algorithm offered in the Edward library with the following settings: sampling 10000 points, using a burnin period of half of the samples, thinning every second sample, and adjusting the step size manually to guarantee an acceptance rate around 0.8. When doing prediction, we use 1000 posterior samples. We repeat each training and testing 10 times and we average the results.
We evaluate the results in terms of classification accuracy and wallclock time required for the training of the model (all the models are run on a nondedicated midrange laptop machine with no GPU support).
Results.
First of all, the data processing via BCH with different hyperparameters produced different coresets. Table 1 reports the number of data points selected, specifying the number of samples in the minority class that have been preserved, and the wallclock computation time as a function of the hyperparameter . Notice that the number of iterations does not correspond to the number of coreset samples selected; in all the cases the algorithm selects a number of samples well below this threshold. With respect to the original number of samples, the amount of data points selected by BCH ranges from around one tenth, when using a low , to one third, when using a high .
Hyperparameter  Time(s)  

84/1  82/3  82/1  88/1  87/3  
184/4  191/10  186/7  189/9  200/10  
259/5  270/15  239/4  252/8  252/6 
. The table reports the number of data points selected to the left of the slash (/), and the number of these points belonging to the minority class to the right of the slash (/). The last column reports average and standard deviation of the wallclok time to compute the coresets.
Figure 1 shows the accuracy of our models on the different data sets we considered. Consistently with our expectations, the two linear models, SVM and BLR on the whole data set, perform similarly; BLR models trained on coresets show, in general, decreasing performances as we decreased the hyperparameter of the BCH algorithm.
Figure 2 compares the wallclock time of each algorithm. The highly optimized SVM algorithm shows some variability, but in general terminates in tenths of seconds. On the other hand, BLR takes up to two orders of magnitude longer. BLR on coresets is faster, even if on the same time scale; surprisingly using coresets with took the shortest time, which may be due to the particularly good subselection of points, or, more likely, to other contingent processes running on the same machine.
Discussion.
These basic experiments highlight that the timespace savings offered by BCH inevitably come at the cost of the accuracy of the final model. The number of iterations provides a key hyperparameter to manage such a tradeoff, as it exchanges the dimension of the data set for the accuracy of the model.
Notice that from these experiments, the time saving offered by coresets does not appear particularly remarkable. It is worth, though, to underline that such an improvement is relevant when related to the small data sets we are processing. The time savings when using larger data set are discussed in detail in Campbell and Broderick (2017).
Interestingly, the computation of coresets has a subsampling effect with respect to the minority class, as shown in Table 1: while in the original data set the ratio between the two classes was set to 1:10, this ratio has sensibly decreased. This may seem undesirable if we were expecting BCH to produce a more balanced data set; in reality, though, the algorithm selects only samples useful for a proper reconstruction of the likelihood function, and the result seems to suggest that the instances of malicious behaviours may actually be quite redundant, probably due to the fact that we are considering only one specific form of attack (brute force).
6 Simulation 2: BCH in a Streaming Environment
In this simulation we try to setup a more interesting and realistic scenario. We simulate the collection of batches of data in realtime and we learn from the cumulative set of collected samples. The aim is to evaluate how the learning process would be affected if the sets of collected data were to be downsized using BCH before being processed and stored. Such a scenario seems particularly interesting because BCH would immediately discard redundant data, thus solving at once the problem of making Bayesian inference feasible and reducing the required amount of memory and storage space.
Protocol.
As before, we generate five subsets of training data and test data , . Now, however, instead of processing each data set independently, we simulate the arrival of a data set at time steps . At each time step , we want to learn from all the collected data sets and so we pool together all the data , for . Notice that all the samples are independent and identically distributed.
When using coresets, we apply BCH to each training data set as soon as it is collected. At each time step , instead of aggregating together all the previously collected data, we just aggregate the coresets. This operation is theoretically justified by the possibility of aggregating coreset computed in parallel (Campbell and Broderick, 2017). We use the same hyperparameters for BCH used in the previous simulation.
We also run the same models as before, and we repeat each simulation 10 times.
Results.
We work with the same coresets computed in the previous simulation and we refer back the reader to Table 1 for their details.
Figure 3 shows the accuracy of the models on the different data sets we generated. The starting values of accuracy computed on a single data set () are consistent with the values computed in the previous experiments and shown in Figure 1. When we start aggregating more data sets, we notice that the performance of SVM and HMC on the whole data set is only slightly improved; on the other hand, the performance of BLR on coresets shows a consistent improvement. Even aggregating only two coresets () the performance gap between BLR on coresets and SVM or BLR on the whole data set is significantly reduced. This improvement is expected when aggregating two coresets computed with the hyperparameter ; in this case, the final amount of selected data points would be close to the amount obtained computing a single coreset with the hyperparameter ; and we know from the previous simulations that the performance of BLR trained on a coreset computed with the hyperparameter is very close to the performance of BLR on the whole data set. More surprising is the improvement registered by aggregating only two coresets computed with hyperparameter .
Figure 2 compares the wallclock time of each algorithm. Again, the time scale of the two family of algorithms, SVM and BLR, are very different. However, notice that while the computational time for SVM and BLR on the whole data sets tend to grow in a linear fashion, the growth in the required computational time when running BLR on coresets is almost flat.
Discussion.
This simulation showed the potential advantages that could be obtained by deploying BCH in a streaming scenario. In such an instance, the aggregation of two or more coresets can provide a performance very close to SVM or BLR trained on the whole data. One of the most significant advantages, though, is that BCH reduces the amount of data in realtime before learning, thus limiting the amount of memory necessary for processing; guaranteeing a sublinear growth in the time required for learning as more data are gathered may prove especially advantageous when data is collected in realworld environments in which batches of data are generated continuously over multiple timesteps.
7 Conclusion and Future Work
This preliminary study showed the feasibility of applying BCH to network data. Network intrusion detection could take great advantage by employing fully probabilistic descriptions of network traffic, and BCH may prove to be an enabler for such an approach. Moreover, we showed that the same algorithm is also effective in reducing the number of samples to be stored; this issue is particularly relevant when collecting network packets, as the amount of data may quickly grow and cause severe challenges in their management.
Our experiments first confirmed the concrete tradeoff between model accuracy and timespace saving when adopting BCH. More interestingly, we also investigated how BCH may be deployed in a dynamic streaming environment in which data samples would be filtered in realtime before processing. This last scenario returned particularly good and interesting results, showing that BCH can be effectively used to subselect relevant data samples at different time steps and aggregate together only the coresets. We demonstrated that in a streaming scenario the use of BCH may guarantee a better scalability by ensuring that the computational time for learning grows in a strongly sublinear fashion.
Of course this study is just a preliminary evaluation of the potential of BCH applied to the challenging problem of processing the large data sets for network security. Further investigation is clearly necessary to assess more precisely the role that BCH may serve. Immediate directions of further study that we consider are the following: applying our protocol to bigger and more realistic data sets; include other typologies of attacks; compare BCH to other data reduction techniques, such as random sampling or kmeans. More interesting questions concern also the recursive application of BCH in a streaming scenario and its effectiveness when used to process streaming data that do not conform to the assumption of independent and identically distributed data anymore.
References
 Barber (2012) David Barber. Bayesian reasoning and machine learning. Cambridge University Press, 2012.
 Bishop (2006) Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
 Campbell and Broderick (2017) Trevor Campbell and Tamara Broderick. Automated scalable bayesian inference via hilbert coresets. arXiv preprint arXiv:1710.05053, 2017.
 Campbell and Broderick (2018) Trevor Campbell and Tamara Broderick. Bayesian coreset construction via greedy iterative geodesic ascent. arXiv preprint arXiv:1802.01737, 2018.
 Coleman et al. (2018) Cody Coleman, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. Select via proxy: Efficient data selection for training deep networks. 2018.
 Cortes and Vapnik (1995) Corinna Cortes and Vladimir Vapnik. Supportvector networks. Machine learning, 20(3):273–297, 1995.
 GarciaTeodoro et al. (2009) Pedro GarciaTeodoro, Jesus DiazVerdejo, Gabriel MaciáFernández, and Enrique Vázquez. Anomalybased network intrusion detection: Techniques, systems and challenges. computers & security, 28(12):18–28, 2009.
 Gardiner and Nagaraja (2016) Joseph Gardiner and Shishir Nagaraja. On the security of machine learning in malware c&c detection: A survey. ACM Computing Surveys (CSUR), 49(3):59, 2016.
 Giordano et al. (2015) Ryan J Giordano, Tamara Broderick, and Michael I Jordan. Linear response methods for accurate covariance estimates from mean field variational bayes. In Advances in Neural Information Processing Systems, pages 1441–1449, 2015.
 Givens and Hoeting (2012) Geof H Givens and Jennifer A Hoeting. Computational statistics, volume 710. John Wiley & Sons, 2012.
 Huggins et al. (2016) Jonathan Huggins, Trevor Campbell, and Tamara Broderick. Coresets for scalable bayesian logistic regression. In Advances in Neural Information Processing Systems, pages 4080–4088, 2016.
 Johndrow et al. (2015) James E Johndrow, Jonathan C Mattingly, Sayan Mukherjee, and David Dunson. Approximations of markov chains and highdimensional bayesian inference. arXiv preprint arXiv:1508.03387, 2015.
 LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
 Neiswanger et al. (2013) Willie Neiswanger, Chong Wang, and Eric Xing. Asymptotically exact, embarrassingly parallel mcmc. arXiv preprint arXiv:1311.4780, 2013.
 Northcutt and Novak (2002) Stephen Northcutt and Judy Novak. Network intrusion detection. Sams Publishing, 2002.
 Shalizi (2013) Cosma Shalizi. Advanced data analysis from an elementary point of view, 2013.
 Sharafaldin et al. (2018) Iman Sharafaldin, Arash Habibi Lashkari, and Ali A Ghorbani. Toward generating a new intrusion detection dataset and intrusion traffic characterization. In ICISSP, pages 108–116, 2018.
 Srivastava et al. (2015) Sanvesh Srivastava, Volkan Cevher, Quoc Dinh, and David Dunson. Wasp: Scalable bayes via barycenters of subset posteriors. In Artificial Intelligence and Statistics, pages 912–920, 2015.
 Tran et al. (2016) Dustin Tran, Alp Kucukelbir, Adji B. Dieng, Maja Rudolph, Dawen Liang, and David M. Blei. Edward: A library for probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787, 2016.