1 Introduction
Artificial intelligence and machine learning have achieved considerable successes in recent years, and an evergrowing number of disciplines rely on them. Data is now ubiquitous, and there is great value from understanding the data, building e.g. probabilistic graphical models to elucidate the relationships between variables. In the big data era, however, scalability has become crucial for any useful machine learning approach. In this paper, we consider the problem of training graphical models, in particular Dependency Networks Heckerman et al. (2000), on massive data sets. They are cyclic directed graphical models, where the parents of each variable are its Markov blanket, and have been proven successful in various tasks, such as collaborative filtering Heckerman et al. (2000), phylogenetic analysis Carlson et al. (2008), genetic analysis Dobra (2009); Phatak et al. (2010), network inference from sequencing data Allen and Liu (2013), and traffic as well as topic modeling Hadiji et al. (2015).
Specifically, we show that Dependency Networks over Gaussians—arguably one of the most prominent type of distribution in statistical machine learning—admit coresets of size independent of the size of the data set. Coresets are weighted subsets of the data, which guarantee that models fitting them will also provide a good fit for the original data set, and have been studied before for clustering Badoiu et al. (2002); Feldman et al. (2011, 2013); Lucic et al. (2016), classification HarPeled et al. (2007); HarPeled (2015); Reddi et al. (2015), regression Drineas et al. (2006, 2008); Dasgupta et al. (2009); Geppert et al. (2017), and the smallest enclosing ball problem Badoiu and Clarkson (2003, 2008); Feldman et al. (2014); Agarwal and Sharathkumar (2015); we refer to Phillips (2017) for a recent extensive literature overview. Our contribution continues this line of research and generalizes the use of coresets to probabilistic graphical modeling.
Unfortunately, this coreset result does not extend to Dependency Networks over members of the exponential family in general. We prove that Dependency Networks over Poisson random variables
Allen and Liu (2013); Hadiji et al. (2015)do not admit (sublinear size) coresets: every single input point is important for the model and needs to appear in the coreset. This is an important negative result, since count data—the primary target of Poisson distributions—is at the center of many scientific endeavors from citation counts to web page hit counts, from counts of procedures in medicine to the count of births and deaths in census, from counts of words in a document to the count of gamma rays in physics. Here, modeling one event such as the number of times a certain lab test yields a particular result can provide an idea of the number of potentially invasive procedures that need to be performed on a patient. Thus, elucidating the relationships between variables can yield great insights into massive count data. Therefore, despite our worstcase result, we will provide an argument why our coreset construction for Dependency Networks can still work well in practice on count data. To corroborate our theoretical results, we empirically evaluated the resulting Core Dependency Networks (CDNs) on several real data sets. The results demonstrate significant gains over no or naive subsampling, even for count data.
We proceed as follows. We review Dependency Networks (DNs), prove that Gaussian DNs admit sublinear size coresets, and discuss the possibility to generalize this result to count data. Before concluding, we illustrate our theoretical results empirically.
2 Dependency Networks
Most of the existing AI and machine learning literature on graphical models is dedicated to binary, multinominal, or certain classes of continuous (e.g. Gaussian) random variables. Undirected models, aka Markov Random Fields
(MRFs), such as Ising (binary random variables) and Potts (multinomial random variables) models have found a lot of applications in various fields such as robotics, computer vision and statistical physics, among others. Whereas MRFs allow for cycles in the structures, directed models aka
Bayesian Networks (BNs) required acyclic directed relationships among the random variables.Dependency Networks (DNs)—the focus of the present paper—combine concepts from directed and undirected worlds and are due to Heckerman et al. (2000). Specifically, like BNs, DNs have directed arcs but they allow for networks with cycles and bidirectional arcs, akin to MRFs. This makes DNs quite appealing for many applications because we can build multivariate models from univariate distributions Allen and Liu (2013); Yang et al. (2015); Hadiji et al. (2015)
, while still permitting efficient structure learning using local estimtatiors or gradient tree boosting. Generally, if the data are fully observed, learning is done locally on the level of the conditional probability distributions for each variable mixing directed and indirected as needed. Based on these local distributions, samples from the joint distribution are obtained via Gibbs sampling. Indeed, the Gibbs sampling neglects the question of a consistent joint probability distribution and instead makes only use of local distributions. The generated samples, however, are often sufficient to answer many probability queries.
Formally, let
denote a random vector and
its instantiation. A Dependency Network (DN) on is a pair where is a directed, possibly cyclic, graph where each node in corresponds to the random variable . In the set of directed edges , each edge models a dependency between variables, i.e., if there is no edge between and then the variables and are conditionally independent given the other variables indexed by in the network. We refer to the nodes that have an edge pointing to as its parents, denoted by . is a set of conditional probability distributions associated with each variable , whereAs example of such a local model, consider Poisson conditional probability distributions as illustrated in Fig. 1 (left):
Here, highlights the fact that the mean can have a functional form that is dependent on ’s parents. Often, we will refer to it simply as . The construction of the local conditional probability distribution is similar to the (multinomial) Bayesian network case. However, in the case of DNs, the graph is not necessarily acyclic and typically has an infinite range, and hence cannot be represented using a finite table of probability values. Finally, the full joint distribution is simply defined as the product of local distributions:
also called pseudo likelihood. For the Poisson case, this reads
Note, however, that doing so does not guarantee the existence of a consistent joint distribution, i.e., a joint distribution of which they are the conditionals. Bengio et al. (2014), however, have recently proven the existence of a consistent distribution per given evidence, which does not have to be known in closed form, as long as an unordered Gibbs sampler converges.
3 Core Dependency Networks
As argued, learning Dependency Networks (DNs) amounts to determining the conditional probability distributions from a given set of training instances representing the rows of the data matrix over variables. Assuming that is parametrized as a generalized linear model (GLM) McCullagh and Nelder (1989)
, this amounts to estimating the parameters
of the GLM associated with each variable , since this completely determines the local distributions, but will possibly depend on all other variables in the network, and these dependencies define the structure of the network. This view of training DNs as fitting GLMs to the data allows us to develop Core Dependency Networks (CDNs): Sample a coreset and train a DN over certain members of the GLM family on the sampled corest.A coreset is a (possibly) weighted and usually considerably smaller subset of the input data that approximates a given objective function for all candidate solutions: [coreset] Let be a set of points from a universe and let be a set of candidate solutions. Let be a nonnegative measurable function. Then a set is an coreset of for , if
We now introduce the formal framework that we need towards the design of coresets for learning dependency networks. A very useful structural property for
based objective (or loss) functions is the concept of an
subspace embedding.[subspace embedding] An subspace embedding for the columnspace of is a matrix such that
We can construct a sampling matrix which forms an subspace embedding with constant probabilty in the following way: Let be any orthonormal basis for the columnspace of
. This basis can be obtained from the singular value decomposition (SVD)
of the data matrix. Now let and define the leverage scores for . Now we fix a sampling size parameter , sample the input points onebyone with probability and reweight their contribution to the loss function by . Note that, for the sum of squares loss, this corresponds to defining a diagonal (sampling) matrix by with probability and otherwise. Also note, that the expected number of samples is , which also holds with constant probability by Markov’s inequality. Moreover, to give an intuition why this works, note that for any fixed , we haveThe significantly stronger property of forming an subspace embedding, according to Definition 3, follows from a matrix approximation bound given in Rudelson and Vershynin (2007); Drineas et al. (2008). Let be an input matrix with . Let be a sampling matrix constructed as stated above with sampling size parameter . Then forms an subspace embedding for the columnspace of with constant probability.
Let be the SVD of . By Theorem 7 in Drineas et al. (2008) there exists an absolute constant such that
where we used the fact that and by orthonormality of . The last inequality holds by choice of for a large enough absolute constant such that , since
By an application of Markov’s inequality and rescaling , we can assume with constant probability
(1) 
We show that this implies the subspace embedding property. To this end, fix
The first inequality follows by submultiplicativity, and the second from rotational invariance of the spectral norm. Finally we conclude the proof by Inequality (1).
The question arises whether we can do better than . One can show by reduction from the coupon collectors theorem that there is a lower bound of matching the upper bound up to its dependency on . The hard instance is a orthonormal matrix in which the scaled canonical basis is stacked times. The leverage scores are all equal to , implying a uniform sampling distribution with probability for each basis vector. Any rank preserving sample must comprise at least one of them. This is exactly the coupon collectors theorem with coupons which has a lower bound of Motwani and Raghavan (1995). The fact that the sampling is without replacement does not change this, since the reduction holds for arbitrary large creating sufficient multiple copies of each element to simulate the sampling with replacement Tropp (2011).
Now we know that with constant probability over the randomness of the construction algorithm, satisfies the subspace embedding property for a given input matrix . This is the structural key property to show that actually
is a coreset for Gaussian linear regression models and dependency networks. Consider
, a Gaussian dependency network (GDN), i.e., a collection of Gaussian linear regression modelson an arbitrary digraph structure Heckerman et al. (2000). The logarithm of the (pseudo)likelihood Besag (1975) of the above model is given by
A maximum likelihood estimate can be obtained by maximizing this function with respect to which is equivalent to minimizing the GDN loss function
Given , an subspace embedding for the columnspace of as constructed above, is an coreset of for the GDN loss function. Fix an arbitrary . Consider the affine map , defined by . Clearly extends its argument from to dimensions by inserting a entry at position and leaving the other entries in their original order. Let . Note that for each we have
(2) 
and each is a vector in . Thus, the triangle inequality and the universal quantifier in Definition 3 guarantee that
The claim follows by substituting Identity (2). It is noteworthy that computing one single coreset for the columnspace of is sufficient, rather than computing coresets for the different subspaces spanned by .
From Theorem 3 it is straightforward to show that the minimizer found for the coreset is a good approximation of the minimizer for the original data. Given an coreset of for the GDN loss function, let . Then it holds that
Let . Then
The first and third inequalities are direct applications of the coreset property, the second holds by optimality of for the coreset, and the last follows from
Moreover, the coreset does not affect inference within GDNs. Recently, it was shown for (Bayesian) Gaussian linear regression models that the entire multivariate normal distribution over the parameter space is approximately preserved by
subspace embeddings Geppert et al. (2017), which generalizes the above. This implies that the coreset yields a useful pointwise approximation in Markov Chain Monte Carlo inference via random walks like the pseudoGibbs sampler in
Heckerman et al. (2000).4 Negative Result on Coresets for Poisson DNs
Naturally, the following question arises: Do (sublinear size) coresets exist for dependency networks over the exponential family in general? Unfortunately, the answer is no! Indeed, there is no (sublinear size) coreset for the simpler problem of Poisson regression, which implies the result for Poisson DNs. We show this formally by reduction from the communication complexity problem known as indexing.
To this end, recall that the negative loglikelihood for Poisson regression is McCullagh and Nelder (1989); Winkelmann (2008)
Let be a data structure for that approximates likelihood queries for Poisson regression, such that
If then requires bits of storage. We reduce from the indexing problem which is known to have oneway randomized communication complexity Jayram et al. (2008). Alice is given a vector . She produces for every with the points , where denote the unit roots in the plane, i.e., the vertices of a regular polygon of radius in canonical order. The corresponding counts are set to . She builds and sends of size to Bob, whose task is to guess the bit . He chooses to query
. Note that this affine hyperplane separates
from the other scaled unit roots since it passes exactly through and . Also, all points are within distance from each other by construction and consequently from the hyperplane. Thus, for all .If , then does not exist and the cost is at most
If then is in the expensive halfspace and at distance exactly
So the cost is bounded below by .
Given , Bob can distinguish these two cases based on the data structure only, by deciding whether is strictly smaller or larger than . Consequently , since this solves the indexing problem.
Note that the bound is given in bit complexity, but restricting the data structure to a sampling based coreset and assuming every data point can be expressed in bits, this means we still have a lower bound of samples. Every sampling based coreset for Poisson regression with approximation factor as in Theorem 4 requires at least samples. At this point it seems very likely that a similar argument can be used to rule out any space constant approximation algorithm. This remains an open problem for now.
5 Why Core DNs for Count Data can still work
So far, we have a quite pessimistic view on extending CDNs beyond Gaussians. In the Gaussian setting, where the loss is measured in squared Euclidean distance, the number of important points, i.e., having significantly large leverage scores, is bounded essentially by . This is implicit in the original early works Drineas et al. (2008) and has been explicitly formalized later Langberg and Schulman (2010); Clarkson and Woodruff (2013)
. It is crucial to understand that this is an inherent property of the norm function, and thus holds for arbitrary data. For the Poisson GLM, in contrast, we have shown that its loss function does not come with such properties from scratch. We constructed a worst case scenario, where basically every single input point is important for the model and needs to appear in the coreset. Usually, this is not the case with statistical models, where the data is assumed to be generated i.i.d. from some generating distribution that fits the model assumptions. Consider for instance a data reduction for Gaussian linear regression via leverage score sampling vs. uniform sampling. It was shown that given the data follows the model assumptions of a Gaussian distribution, the two approaches behave very similarly. Or, to put it another way, the leverage scores are quite uniform. In the presence of more and more outliers generated by the heavier tails of
distributions, the leverage scores increasingly outperform uniform sampling Ma et al. (2015).The Poisson model
(3) 
though being the standard model for count data, suffers from its inherent limitation on equidispersed data since . Count data, however, is often overdispersed especially for large counts. This is due to unobserved variables or problem specific heterogeneity and contagioneffects. The lognormal Poisson model is known to be inferior for data which specifically follows the Poisson model, but turns out to be more powerful in modeling the effects that can not be captured by the simple Poisson model. It has wide applications for instance in econometric elasticity problems. We review the lognormal Poisson model for count data Winkelmann (2008)
A natural choice for the parameters of the lognormal distribution is
in which case we haveIt follows that , where a constant that is independent of , controls the amount of overdispersion. Taking the limit for we arrive at the simple model (3), since the distribution of tends to , the deterministic Dirac delta distribution which puts all mass on . The inference might aim for the lognormal Poisson model directly as in Zhou et al. (2012)
, or it can be performed by (pseudo)maximum likelihood estimation of the simple Poisson model. The latter provides a consistent estimator as long as the loglinear mean function is correctly specified, even if higher moments do not possess the limitations inherent in the simple Poisson model
Winkelmann (2008).Summing up our review on the count modeling perspective, we learn that preserving the loglinear mean function in a Poisson model is crucial towards consistency of the estimator. Moreover, modeling counts in a lognormal model gives us intuition why leverage score sampling can capture the underlying linear model accurately: In the lognormal Poisson model, follows a lognormal distribution. It thus holds for that
by independence of the observations, which implies
Omitting the bias in each intercept term (which can be cast into
), we notice that this yields again an ordinary least squares problem
defined in the columspace of .Negative Log Pseudo Likelihood  RMSE  Training Time 
There is still a missing piece in our argumentation. In the previous section we have used that the coreset construction is an subspace embedding for the columnspace of the whole data set including the dependent variable, i.e., for . We face two problems. First, is only implicitly given in the data, but is not explicitly available. Second, is a vector derived from in our setting and might be different for any of the instances. Fortunately, it was shown via more complicated arguments Drineas et al. (2008), that it is sufficient for a good approximation, if the sampling is done obliviously to the dependent variable. The intuition comes from the fact that the loss of any point in the subspace can be expressed via the projection of onto the subspace spanned by , and the residual of its projection. A good approximation of the subspace implicitly approximates the projection of any fixed vector, which is then applied to the residual vector of the orthogonal projection. This solves the first problem, since it is only necessary to have a subspace embedding for . The second issue can be addressed by increasing the sample size by a factor of for boosting the error probability to and taking a union bound.
6 Empirical Illustration
Our intention here is to corroborate our theoretical results by investigating empirically the following questions: (Q1) How does the performance of CDNs compare to DNs with access to the full training data set and to a uniform sample from the training data set? and how does the empirical error behave according to the sample sizes? (Q2) Do coresets affect the structure recovered by the DN? To this aim, we implemented (C)DNs in Python calling R. All experiments ran on a Linux machine (56 cores, 4 GPUs, and 512GB RAM).
Benchmarks on MNIST and Traffic Data (Q1): We considered two datasets. In a first experiment, we used the MNIST^{1}^{1}1http://yann.lecun.com/exdb/mnist/ data set of handwritten labeled digits. We employed the training set consisting of 55000 images, each with 784 pixels, for a total of 43,120,000 measurements, and trained Gaussian DNs on it. The second data set we considered contains traffic count measurements on selected roads around the city of Cologne in Germany Ide et al. (2015). It consists of 7994 timestamped measurements taken by 184 sensors for a total of 1,470,896 measurements. On this dataset we trained Poisson DNs. For each dataset, we performed 10 fold crossvalidation for training a full DN (Full) using all the data, leverage score sampling coresets (CDNs), and uniform samples (Uniform), for different sample sizes. We then compared the predictions made by all the DNs and the time taken to train them. For the predictions on the MNIST dataset, we clipped the predictions to the range [0,1] for all the DNs. For the Traffic dataset, we computed the predictions of every measurement rounded to the largest integer less than or equal to .
Sample  MNIST  Traffic  

portion  GCDN  GUDN  PCDN  PUDN 
10%  18.03%  11162.01%  6.81%  9.6% 
20%  0.57%  13.86%  2.9%  3.17% 
30%  0.01%  13.33%  2.04%  1.68% 
40%  0.01%  2.3%  1.59%  0.99% 
Fig. 2 summarizes the results. As one can see, CDNs outperform DNs trained on full data and are orders of magnitude faster. Compared to uniform sampling, coresets are competitive. Actually, as seen on the traffic dataset, CDNs can have more predictive power than the “optimal” model using the full data. This is in line with Mahoney (2011), who observed that coresets implicitly introduce regularization and lead to more robust output. Table 1 summarizes the empirical relative errors between (C/U)DNs and DNs trained on all the data. CDNs clearly recover the original model, at a fraction of training data. Overall, this answers (Q1) affirmatively.
Gaussian CDN  
Poisson CDN  
Relationship Elucidation (Q2): We investigated the performance of CDNs when recovering the graph structure of word interactions from a text corpus. For this purpose, we used the NIPS^{2}^{2}2https://archive.ics.uci.edu/ml/datasets/bag+of+words bagofwords dataset. It contains 1,500 documents with a vocabulary above 12k words. We considered the 100 most frequent words.
Fig. 3 illustrates the results qualitatively. It shows three CDNs of sampling sizes 40%, 70% and 100% for Gaussians (top) after a transformation and for Poissons (bottom): CDNs capture well the gist of the NIPS corpus. Table 2 confirms this quantitatively. It shows the Frobenius norms between the DNs: CDNs capture the gist better than naive, i.e., uniform sampling. This answers (Q2) affirmatively.
To summarize our empirical results, the answers to questions (Q1) and (Q2) show the benefits of CDNs.
7 Conclusions
Inspired by the question of how we can train graphical models on a massive dataset, we have studied coresets for estimating Dependency networks (DNs). We established the first rigorous guarantees for obtaining compressed approximations of Gaussian DNs for large data sets. We proved worstcase impossibility results on coresets for Poisson DNs. A review of lognormal Poisson modeling of counts provided deep insights into why our coreset construction still performs well for count data in practice.
Our experimental results demonstrate, the resulting Core Dependency Networks (CDNs) can achieve significant gains over no or naive subsampling, even in the case of count data, making it possible to learn models on much larger datasets using the same hardware.
Sample  UDN  CDN  

portion  Gaussian  Poisson  Gaussian  Poisson 
40%  9.0676  6.4042  3.9135  0.6497 
70%  4.8487  1.6262  2.6327  0.3821 
CDNs provide several interesting avenues for future work. The conditional independence assumption opens the door to explore hybrid multivariate models, where each variable can potentially come from a different GLM family or link function, on massive data sets. This can further be used to hint at independencies among variables in the multivariate setting, making them useful in many other large data applications. Generally, our results may pave the way to establish coresets for deep models using the close connection between dependency networks and deep generative stochastic networks Bengio et al. (2014), sumproduct networks Poon and Domingos (2011); Molina et al. (2017), as well as other statistical models that build multivariate distributions from univariate ones Yang et al. (2015).
Acknowledgements: This work has been supported by Deutsche Forschungsgemeinschaft (DFG) within the Collaborative Research Center SFB 876 ”Providing Information by ResourceConstrained Analysis”, projects B4 and C4.
References
 Agarwal and Sharathkumar (2015) Pankaj K. Agarwal and R. Sharathkumar. Streaming algorithms for extent problems in high dimensions. Algorithmica, 72(1):83–98, 2015. doi: 10.1007/s0045301398464. URL https://doi.org/10.1007/s0045301398464.
 Allen and Liu (2013) Genevera I. Allen and Zhandong Liu. A local poisson graphical model for inferring networks from sequencing data. IEEE Transactions on Nanobioscience, 12(3):189–198, 2013. ISSN 15361241.
 Badoiu and Clarkson (2003) Mihai Badoiu and Kenneth L. Clarkson. Smaller coresets for balls. In Proc. of SODA, pages 801–802, 2003.
 Badoiu and Clarkson (2008) Mihai Badoiu and Kenneth L. Clarkson. Optimal coresets for balls. Computational Geometry, 40(1):14–22, 2008. doi: 10.1016/j.comgeo.2007.04.002. URL https://doi.org/10.1016/j.comgeo.2007.04.002.
 Badoiu et al. (2002) Mihai Badoiu, Sariel HarPeled, and Piotr Indyk. Approximate clustering via coresets. In Proceedings of STOC, pages 250–257, 2002.
 Bengio et al. (2014) Y. Bengio, E. Laufer, G. Alain, and J. Yosinski. Deep generative stochastic networks trainable by backprop. In Proc. of ICML, pages 226–234, 2014.
 Besag (1975) Julian Besag. Statistical analysis of nonlattice data. Journal of the Royal Statistical Society, Series D, 24(3):179–195, 1975.
 Carlson et al. (2008) Jonathan M. Carlson, Zabrina L. Brumme, Christine M. Rousseau, Chanson J. Brumme, Philippa Matthews, Carl Myers Kadie, James I. Mullins, Bruce D. Walker, P. Richard Harrigan, Philip J. R. Goulder, and David Heckerman. Phylogenetic dependency networks: Inferring patterns of CTL escape and codon covariation in HIV1 gag. PLoS Computational Biology, 4(11), 2008.
 Clarkson and Woodruff (2013) Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. In Proc. of STOC, pages 81–90, 2013.
 Dasgupta et al. (2009) Anirban Dasgupta, Petros Drineas, Boulos Harb, Ravi Kumar, and Michael W. Mahoney. Sampling algorithms and coresets for regression. SIAM Journal on Computing, 38(5):2060–2078, 2009. doi: 10.1137/070696507. URL https://doi.org/10.1137/070696507.
 Dobra (2009) Adrian Dobra. Variable selection and dependency networks for genomewide data. Biostatistics, 10(4):621–639, 2009.
 Drineas et al. (2006) Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Sampling algorithms for regression and applications. In Proc. of SODA, pages 1127–1136, 2006. URL http://dl.acm.org/citation.cfm?id=1109557.1109682.
 Drineas et al. (2008) Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Relativeerror CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844–881, 2008. doi: 10.1137/07070471X. URL https://doi.org/10.1137/07070471X.
 Feldman et al. (2011) Dan Feldman, Matthew Faulkner, and Andreas Krause. Scalable training of mixture models via coresets. In Proc. of NIPS, 2011.
 Feldman et al. (2013) Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constantsize coresets for means, PCA and projective clustering. In Proc. of SODA, pages 1434–1453, 2013.
 Feldman et al. (2014) Dan Feldman, Alexander Munteanu, and Christian Sohler. Smallest enclosing ball for probabilistic data. In Proc. of SOCG, pages 214–223, 2014. doi: 10.1145/2582112.2582114. URL http://doi.acm.org/10.1145/2582112.2582114.
 Geppert et al. (2017) Leo N Geppert, Katja Ickstadt, Alexander Munteanu, Jens Quedenfeld, and Christian Sohler. Random projections for Bayesian regression. Statistics and Computing, 27(1):79–101, 2017.

Hadiji et al. (2015)
Fabian Hadiji, Alejandro Molina, Sriraam Natarajan, and Kristian Kersting.
Poisson dependency networks: Gradient boosted models for multivariate count data.
MLJ, 100(23):477–507, 2015.  HarPeled (2015) Sariel HarPeled. A simple algorithm for maximum margin classification, revisited. arXiv, 1507.01563, 2015. URL http://arxiv.org/abs/1507.01563.
 HarPeled et al. (2007) Sariel HarPeled, Dan Roth, and Dav Zimak. Maximum margin coresets for active and noise tolerant learning. In Proc. of IJCAI, pages 836–841, 2007.

Heckerman et al. (2000)
D. Heckerman, D. Chickering, C. Meek, R. Rounthwaite, and C. Kadie.
Dependency networks for density estimation, collaborative filtering, and data visualization.
Journal of Machine Learning Research, 1:49–76, 2000.  Ide et al. (2015) Christoph Ide, Fabian Hadiji, Lars Habel, Alejandro Molina, Thomas Zaksek, Michael Schreckenberg, Kristian Kersting, and Christian Wietfeld. LTE connectivity and vehicular traffic prediction based on machine learning approaches. In Proc. of IEEE VTC Fall, 2015.
 Jayram et al. (2008) T. S. Jayram, Ravi Kumar, and D. Sivakumar. The oneway communication complexity of Hamming distance. Theory of Computing, 4(1):129–135, 2008. doi: 10.4086/toc.2008.v004a006. URL https://doi.org/10.4086/toc.2008.v004a006.
 Langberg and Schulman (2010) Michael Langberg and Leonard J. Schulman. Universal epsilonapproximators for integrals. In Proc. of SODA, 2010.
 Lucic et al. (2016) Mario Lucic, Olivier Bachem, and Andreas Krause. Strong coresets for hard and soft bregman clustering with applications to exponential family mixtures. In Proc. of AISTATS, pages 1–9, 2016.
 Ma et al. (2015) Ping Ma, Michael W. Mahoney, and Bin Yu. A statistical perspective on algorithmic leveraging. JMLR, 16:861–911, 2015. URL http://dl.acm.org/citation.cfm?id=2831141.
 Mahoney (2011) Michael W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3(2):123–224, 2011. doi: 10.1561/2200000035. URL https://doi.org/10.1561/2200000035.
 McCullagh and Nelder (1989) Peter McCullagh and John Nelder. Generalized Linear Models. Chapman and Hall, 1989.
 Molina et al. (2017) Alejandro Molina, Sriraam Natarajan, and Kristian Kersting. Poisson sumproduct networks: A deep architecture for tractable multivariate poisson distributions. In Proc. of AAAI, 2017.
 Motwani and Raghavan (1995) Rajeev Motwani and Prabhakar Raghavan. Randomized Algorithms. Cambridge Univ. Press, 1995. ISBN 0521474655.
 Phatak et al. (2010) Aloke Phatak, Harri T. Kiiveri, Line Harder Clemmensen, and William J. Wilson. NetRaVE: constructing dependency networks using sparse linear regression. Bioinformatics, 26(12):1576–1577, 2010.
 Phillips (2017) Jeff M Phillips. Coresets and sketches. In Handbook of Discrete and Computational Geometry. 2017.
 Poon and Domingos (2011) Hoifung Poon and Pedro Domingos. SumProduct Networks: A New Deep Architecture. Proc. of UAI, 2011.
 Reddi et al. (2015) Sashank J. Reddi, Barnabás Póczos, and Alexander J. Smola. Communication efficient coresets for empirical loss minimization. In Proc. of UAI, pages 752–761, 2015.
 Rudelson and Vershynin (2007) Mark Rudelson and Roman Vershynin. Sampling from large matrices: An approach through geometric functional analysis. Journal of the ACM, 54(4):21, 2007. doi: 10.1145/1255443.1255449. URL http://doi.acm.org/10.1145/1255443.1255449.
 Tropp (2011) Joel A. Tropp. Improved analysis of the subsampled randomized hadamard transform. Advances in Adaptive Data Analysis, 3(12):115–126, 2011. doi: 10.1142/S1793536911000787. URL https://doi.org/10.1142/S1793536911000787.
 Winkelmann (2008) Rainer Winkelmann. Econometric Analysis of Count Data. Springer, 5th edition, 2008. ISBN 3540776486, 9783540776482.
 Yang et al. (2015) Eunho Yang, Pradeep Ravikumar, Genevera I. Allen, and Zhandong Liu. On graphical models via univariate exponential family distributions. JMLR, 16:3813–3847, 2015.
 Zhou et al. (2012) Mingyuan Zhou, Lingbo Li, David B. Dunson, and Lawrence Carin. Lognormal and gamma mixed negative binomial regression. In Proceedings of ICML, 2012. URL http://icml.cc/2012/papers/665.pdf.