1 Introduction
Statistical inference is an important technique to express hypothesis and reason about data in data analytical tasks. Today, many big data applications are based on statistical inference. Examples include topic modeling [5, 21]
[22, 13, 16], spam filtering [19], to name a few.One of most critical steps of statistical inference is to construct a statistical model to formally represent the underlying statistical inference task [8]
. The development of a statistical model is never trivial because a domain user may have to devise and implement many different models before finding a promising one for a specific task. Currently, most scalable machine learning libraries (e.g. MLlib
[4]) only contain standard models like support vector machine, linear regression, latent Dirichlet allocation (LDA)
[5], etc. To carry out statistical inference on customized models with big data, the user has to implement her own models and inference codes on a distributed framework like Apache Spark [26] and Hadoop [1].Developing inference code requires extensive knowledge in both statistical inference and programming techniques in distributed frameworks. Moreover, model definitions, inference algorithms, and data processing tasks are all mixed up in the resulting code, making it hard to debug and reason about. For even a slight alteration to the model in quest of the most promising one, the model designer will have to rederive the formulas and reimplement the inference codes, which is tedious and errorprone.
In this paper, we present InferSpark, a probabilistic programming framework on top of Spark. Probabilistic programming is an emerging paradigm that allows statistician and domain users to succinctly express a model definition within a host programming language and transfers the burden of implementing the inference algorithm from the user to the compilers and runtime systems [3]. For example, Infer.NET [17]
is a probabilistic programming framework that extends C#. The user can express, say, a Bayesian network in C# and the compiler will generate code to perform inference on it. Such code could be as efficient as the implementation of the same inference algorithm carefully optimized by an experienced programmer.
So far, the emphasis of probabilistic programming has been put on the expressiveness of the languages and the development of efficient inference algorithms (e.g., variational message passing [24], Gibbs sampling [6], MetropolisHastings sampling [7]) to handle a wider range of statistical models. The issue of scaling out these frameworks, however, has not been addressed. For example, Infer.NET only works on a single machine. When we tried to use Infer.NET to train an LDA model of 96 topics and 9040word vocabulary on only 3% of Wikipedia articles, the actual memory requirement has already exceeded 512GB, the maximum memory of most commodity servers today. The goal of InferSpark is thus to bring probabilistic programming to Spark, a predominant distributed data analytic platform, for carrying out statistical inference at scale. The InferSpark project consists of two parts:

Extending Scala to support probabilistic programming
Spark is implemented in Scala due to its functional nature. The fact that both preprocessing and postprocessing can be included in one Scala program substantially eases the development process. In InferSpark, we extend Scala with probabilistic programming constructs to leverage its functional features. Carrying out statistical inference with InferSpark is simple and intuitive, and implicitly enjoys the distributed computing capability brought by Spark. As an example, the LDA statistical model was implemented using 503 lines of Scala code in MLlib (excluding comments, javadocs, blank lines, and utilities of MLlib). With InferSpark, we could implement that using only 7 lines of Scala code (see Figure 1).

Building an InferSpark compiler and a runtime system
InferSpark compiles InferSpark models into Scala classes and objects that implement the inference algorithms with a set of API. The user can call the API from their Scala programs to specify the input (observed) data and query about the model (e.g. compute the expectation of some random variables or retrieve the parameters of the posterior distributions).
Currently, InferSpark supports Bayesian network models. Bayesian network is a major branch of probabilistic graphical model and it has already covered models like naive Bayes, LDA, TSM
[16], etc. The goal of this paper is to describe the workflow, architecture, and Bayesian network implementation of InferSpark. We will opensource InferSpark and support other models (e.g., Markov networks) afterwards.To the best of our knowledge, InferSpark is the first endeavor to bring probabilistic programming into the (big) data engineering domain. Efforts like MLI [20] and SystemML [10] all aim at easing the difficulty of developing distributed machine learning techniques
(e.g., stochastic gradient descent (SGD)). InferSpark aims at easing the complexity of developing
custom statistical models, with statistician, data scientists, and machine learning researchers as the target users. This paper presents the following technical contributions of InferSpark so far.
We present the extension of Scala’s syntax that can express various sophisticated Bayesian network models with ease.

We present the details of compiling and executing an InferSpark program on Spark. That includes the mechanism of automatic generating efficient inference codes that include checkpointing (to avoid long lineage), proper timing of caching and anticaching (to improve efficiency under memory constraint), and partitioning (to avoid unnecessary replication and shuffling).

We present an empirical study that shows InferSpark can enable statistical inference on both customized and standard models at scale.
The remainder of this paper is organized as follows: Section 2 presents the essential background for this paper. Section 3 then gives an overview of InferSpark. Section 4 gives the implementation details of InferSpark. Section 5 presents an evaluation study of the current version of InferSpark. Section 6 discusses related work and Section 7 contains our concluding remarks.
2 Background
This section presents some preliminary knowledge about statistical inference and, in particular, Bayesian inference using variational message passing, a popular variational inference algorithm, as well as its implementation concerns on Apache Spark/GraphX stack.
2.1 Statistical Inference
Statistical inference is a common machine learning task of obtaining the properties of the underlying distribution of data. For example, one can infer from many coin tosses the probability of the coin turning up head by counting how many tosses out of the all tosses are head. There are two different approaches to model the number of heads: the frequentist approach and the Bayesian approach.
Let be the total number of tosses and be the number of heads. In frequentist approach, the probability of coin turning up head is viewed as an unknown fixed parameter so the best guess would be the number of heads in the results over the total number of tosses .
In Bayesian approach, the probability of head is viewed as a hidden random variable drawn from a prior distribution, e.g.,
, the uniform distribution over [0, 1]. According to the Bayes Theorem, the posterior distribution of the probability of coin turning up head can be calculated as follows:
(1) 
The frequentist approach needs smoothing and regularization techniques to generalize on unseen data while the Bayesian approach does not because the latter can capture the uncertainty by modeling the parameters as random variables.
2.2 Probabilistic Graphical Model
Probabilistic graphical model [14] (PGM) is a graphical representation of the conditional dependencies in statistical inference. Two types of PGM are widely used: Bayesian networks and Markov networks. Markov networks are undirected graphs while Bayesian networks are directed acyclic graphs. Each type of PGM can represent certain independence constraints that the other cannot represent. InferSpark currently supports Bayesian networks and regards Markov networks as the next step.
In a Bayesian network, the vertices are random variables and the edges represent the conditional dependencies between the random variables. The joint probability of a Bayesian network can be factorized into conditional probabilities of each vertex conditioned on their parents . Figure 2 is the Bayesian network of the coin flip model. Here, the factors in the joint probability are and . The plate surrounding represents repetition of the random variables. The subscript is the number of repetitions. The outcome of coin tosses is repeated times and each depends on the probability . The Bayesian network of the coin flip model encodes the joint probability .
Bayesian networks are generative models, which describes the process of generating random data from hidden random variables. The typical inference task on generative model is to calculate the posterior distribution of the hidden variables given the observed data. In the coin flip model, the observed data are the outcomes of coin tosses and the hidden random variable is the probability of head. The inference task is to calculate the posterior in Equation 1.
2.3 Bayesian Inference Algorithms
Inference of the coin flip model is simple because the posterior (Equation 1) has a tractable analytical solution. However, most realworld models are more complex than that and their posteriors do not have a familiar form. Moreover, even calculating the probability mass function or probability density function at one point is hard because of the difficulty of calculating the probability of the observed data in the denominator of the posterior. The probability of the observed data is also called evidence. It is the summation or integration over the space of hidden variables and is hard to calculate because of exponential growth of the number of terms.
Consider a twocoin model in Figure 3, where we first decide which coin to toss, with probability to choose the first coin and probability to choose the second coin (). We then toss the chosen coin, which has probability to turn up head. This process is repeated times. The twocoin model is a mixture model, which represents the mixture of multiple subpopulations. Each such subpopulation, in this case and , have their own distributions, while the observation can only be obtained on the overall population, that is the number of heads after tosses. The twocoin model has no tractable analytical solution. Assuming Beta priors for , and , the posterior distribution is:
where the joint distribution
is:The integral in the denominator of the posterior is intractable because it has
terms and takes exponential time to compute. Since solving for the exact posterior is intractable, approximate inference algorithms are used instead. Although approximate inference is also NPhard, it performs well in practical applications. Approximate inference techniques include Markov Chain Monte Carlo (MCMC) method, variational inference and so on. MCMC algorithms are inherently nondeterministic, and single random number generator is required to ensure randomness. In a distributed setting, sharing a single random number generator across the nodes in a cluster is a serious performance bottleneck. Having different generators on different nodes would risk the correctness of the MCMC algorithms. On the other hand, variational inference methods such as Variational Message Passing (VMP)
[24] is a deterministic graphbased message passing algorithm, which can be easily adapted to a distributed graph computation model such as GraphX [25]. InferSpark currently supports VMP. Support of other techniques (e.g., MCMC) is included in InferSpark’s opensource agenda.To infer the posterior of the twocoin model using VMP, the original Bayesian network has to be expanded by adding some more hidden random variables. Figure 4 shows Bayesian network with hidden random variables added, where is the index (1 or 2) of the coin chosen for the toss.
The VMP algorithm approximates the posterior distribution with a fully factorized distribution . The algorithm iteratively passes messages along the edges and updates the parameters of each vertex to minimize the KL divergence between and the posterior. Because the true posterior is unknown, VMP algorithm maximizes the evidence lower bound (ELBO), which is equivalent to minimizing the KL divergence [24]. However, ELBO involves only the expectation of the log likelihoods of the approximated distribution and is thus straightforward to compute.
Figure 5 shows the message passing graph of VMP for the twocoin model with tosses. There are four types of vertices in the twocoin model’s message passing graph: , , , and , with each type corresponding to a variable in the Bayesian network. Vertices in the Bayesian network are expanded in the message passing graph. For example, the repetition of vertex is 2 in the model. So, we have and in the massage passing graph. Edges in the twocoin model’s message passing graph are bidirectional and different messages are sent in different directions. The three edges , , in the Bayesian network thus create six types of edges in the message passing graph: , , , , , and .
Each variable (vertex) is associated with the parameters of its approximate distribution. Initially the parameters can be arbitrarily initialized. For edges whose direction is the same as in the Bayesian network, the message content only depends on the parameters of the sender. For example, the message from to is a vector of expectations of logarithms of and , denoted as in Figure 5. For edges whose direction is opposite of those in the Bayesian network, in addition to the parameters of the sender, the message content may also depend on other parents of the sender in the Bayesian network. For example, the message from to is , which depends both on the observed outcome and the expectations of and .
Based on the message passing graph, VMP selects one vertex in each iteration and pulls messages from ’s neighbor(s). If the message source is the child of in the Bayesian network, VMP also pulls message from ’s other parents. For example, assuming VMP selects in an iteration, it will pull messages from and . Since is the child of in the Bayesian network (Figure 4), and depends on , VMP will first pull a message from to , then pull a message from to . This process however is would not be propagated and is restricted to only ’s direct parents. On receiving all the requested messages, the selected vertex updates its parameters by aggregating the messages.
Implementing the VMP inference code for a statistical model (Bayesian network) requires (i) deriving all the messages mathematically (e.g., deriving ) and (ii) coding the message passing mechanism specifically for . The program tends to contain a lot of boilerplate code because there are many types of messages and vertices. For the twocoin model, would be coded as a vector of integers whereas would be coded as a vector of probabilities (float), etc. Therefore, even a slight alteration to the model, say, from twocoin to twocoinandonedice, all the messages have to be rederived and the message passing code has to be reimplemented, which is tedious to hand code and hard to maintain.
2.4 Inference on Apache Spark
When a domain user has crafted a new model and intends to program the corresponding VMP inference code on Apache Spark, one natural choice is to do that through GraphX, the distributed graph processing framework on top of Spark. Nevertheless, the user still has to go through a number of programming and system concerns, which we believe, should better be handled by a framework like InferSpark instead.
First, the Pregel programming abstraction of GraphX restricts that only updated vertices in the last iteration can send message in the current iteration. So for VMP, when and (Figure 5) are selected to be updated in the last iteration (multiple ’s can be updated in the same iteration when parallelizing VMP), and cannot be updated in the current iteration unfortunately because they require messages from and , which were not selected and updated in the last iteration. Working around this through the use of primitive aggregateMessages and outerJoinVertices API would not make life easier. Specifically, the user would have to handle some low level details such as determining which intermediate RDDs to insert to or evict from the cache.
Second, the user has to determine the best timing to do checkpointing so as to avoid performance degradation brought by the long lineage created by many iterations.
Last but not the least, the user may have to customize a partition strategy for each model being evaluated. GraphX builtin partitioning strategies are general and thus do not work well with message passing graphs, which usually possess (i) complete bipartite components between the posteriors and the observed variables (e.g., , and in Figure 5), and (ii) large repetition of edge pairs induced from the plate (e.g., pairs of in Figure 5). GraphX adopts a vertexcut approach for graph partitioning and a vertex would be replicated to multiple partitions if it lies on the cut. So, imagine if the partition cuts on ’s in Figure 5, that would incur large replication overhead as well as shuffling overhead. Consequently, that really requires the domain users to have excellent knowledge on GraphX in order to carry out efficient inference on Spark.
3 InferSpark Overview
The overall architecture of InferSpark is shown in Figure 6. An InferSpark program is a mix of Bayesian network model definition and normal user code. The Bayesian network construction module separates the model part out, and transforms it into a Bayesian network template. This template is then instantiated with parameters and meta data from the input data at runtime by the code generation module, which produces the VMP inference code and message passing graph. These are then executed on the GraphX distributed engine to produce the final posterior distribution. Next, we describe the three key modules in more details with the example of the twocoin model (Figure 4).
3.1 Running Example
Figure 7 shows the definition of the twocoin model in InferSpark. The definition starts with “@Model” annotation. The rest is similar to a class definition in scala. The model parameters (“alpha” and “beta”) are constants to the model. In the model body, only a sequence of value definitions are allowed, each defining a random variable instead of a normal deterministic variable. The use of “val” instead of “var” in the syntax implies the conditional dependencies between random variables are fixed once defined. For example, line 2 defines the random variable having a symmetric Beta prior .
InferSpark model uses “Range” class in Scala to represent plates. Line 3 defines a plate of size 2 with the probabilities of seeing head in the two coins. The “?” is a special type of “Range” representing a plate of unknown size at the time of model definition. In this case, the exact size of the plate will be provided or inferred from observed variables at run time. When a random variable is defined by mapping from a plate of other random variables, the new random variable is in the same plate as the others. For example, line 5 defines the outcomes as the mapping from to Categorical mixtures, therefore will be in the same plate as . Since the size of the plate surrounding and is unknown, we need to specify the size at run time. We can either explicitly set the length of the “?” or let InferSpark set that based on the number of observed outcomes (line 11).
At the first glance, “?” seems redundant since it can be replaced by a model parameter denoting the size of the plate. However, “?” becomes more useful when there are nested plates. In the twocoin model, suppose after we choose one coin, we toss it multiple times. Figure 8 shows this scenario. Then the outcomes are in two nested plates where the inner plate is repeated times, and each instance may have a different size . Using the “?” syntax for the inner plate, we simply change line 5 to
Ψval x = z.map(z => ?.map(_ => Categorical(phi(z))))
3.2 Bayesian Network Construction
An input InferSpark program is first parsed and separated into two parts: the model definition (“@Model class TwoCoins” in Figure 7) and the ordinary scala program (“object Main” in Figure 7). The model definition is analyzed and transformed into valid scala classes that define a Bayesian network constructed from the model definition (e.g., Figure 9) and the inference/query API. Note the Bayesian network constructed at this stage is only a template (different than Figure 4) because some of the information is not available until run time (e.g., the outcomes , the number of coin flippings and the model parameters and ).
3.3 Metadata Collection
Metadata such as the observed values and the plate sizes missing from the Bayesian networks are collected at runtime. In the twocoin model, an instance of the model is created via the constructor invocation (e.g. “val m = new TwoCoin(1.0, 1.0)” on line 10 of Figure 7). The constructor call provides the missing constants in the prior distributions of and . For each random variable defined in the model definition, there is an interface field with the same name in the constructed object. Observed values are provided to InferSpark by calling the “observe” (line 11 of Figure 7) API on the field. There, the user provides an RDD of observed outcomes “xdata” to InferSpark by calling “m.x.observe(xdata)”. The observe API also triggers the calculation of unknown plate sizes. In this case, the size of plate surrounding and is automatically calculated by counting the number of elements in the RDD.
3.4 Code Generation
When the user calls “infer” API (line 12 of Figure 7) on the model instance, InferSpark checks whether all the missing metadata are collected. If so, it proceeds to annotate the Bayesian network with messages used in VMP, resulting in Figure 10. The expressions that calculate the messages (e.g., ) depend on not only the structure of the Bayesian network and whether the vertices are observed or not, but also practical consideration of efficiency and constraints on GraphX.
To convert the Bayesian network to a message passing graph on GraphX, InferSpark needs to construct a VertexRDD and an EdgeRDD. This step generates the MPG construction code specific to the data. Figure 11 shows the MPG construction code generated for the twocoin model. The vertices are constructed by the union of three RDD’s, one of which from the data and the others from parallelized collections (lines 8 and line 9 in Figure 11). The edges are built from the data only. A partition strategy specific to the data is also generated in this step.
In addition to generating code to build the large message passing graph, the codegen module also generates code for VMP iterative inference. InferSpark, which distributes the computation, needs to create a schedule of parallel updates that is equivalent to the original VMP algorithm, which only updates one vertex in each iteration. Different instances of the same random variables can be updated at the same time. An example update schedule for the twocoins model is ( and ) . VMP inference code that enforces the update schedule is then generated.
3.5 Getting the Results
The inference results can be queried through the “getResult” API on fields in the model instance that retrieves a VertexRDD of approximate marginal posterior distribution of the corresponding random variable. For example, in Line 13 of Figure 7, “m.phi.getResult()” returns a VertexRDD of two Dirichlet distributions. The user can also call “lowerBound” on the model instance to get the evidence lower bound (ELBO) of the result, which is higher when the KL divergence between the approximate posterior distribution and the true posterior is smaller.
The user can also provide a callback function that will be called after initialization and each iteration. In the function, the user can write progress reporting code based on the inference result so far. For example, this function may return false whenever the ELBO improvement is smaller than a threshold (see Figure 12) indicating the result is good enough and the inference should be terminated.
4 Implementation
The main jobs of InferSpark are Bayesian network construction and code generation (Figure 6). Bayesian network construction first extracts Bayesian network template from the model definition and transforms it into a Scala class with inference and query APIs at compile time. Then, code generation takes those as inputs and generates a Spark program that can generate the messaging passing graph with VMP on top. Afterwards, the generated program would be executed on Spark.
We use the code generation approach because it enables a more flexible API than a library. For a library, there are fixed number of APIs for user to provide data, while InferSpark can dynamically generate custommade APIs according to the structure of the Bayesian network. Another reason for using code generation is that compiled programs are always more efficient than interpreted programs.
4.1 Bayesian Network Construction
In this offline compilation stage, the model definition is first transformed into a Bayesian network. We use the macro annotation, a compiletime meta programming facility of Scala. It is currently supported via the macroparadise plugin. After the parser phase, the class annotated with “@Model” annotation is passed from the compiler to its transform method. InferSpark treats the class passed to it as model definition and transforms it into a Bayesian network.
ModelDef  ::=  ‘@Model’ ‘class’ id 
‘(’ ClassParamsOpt ‘)’ ‘{’ Stmts ‘}’  
ClassParamsOpt  ::=  ‘’ /* Empty */ 
  ClassParams  
ClassParams  ::=  ClassParam [‘,’ ClassParams] 
ClassParam  ::=  id ‘:’ Type 
Type  ::=  ‘Long’  ‘Double’ 
Stmts  ::=  Stmt [[semi] Stmts] 
Stmt  ::=  ‘val’ id = Expr 
Expr  ::=  ‘{’ [Stmts [semi]] Expr ‘}’ 
  DExpr  
  RVExpr  
  PlateExpr  
  Expr ‘.’ ‘map’ ‘(’ id => Expr ‘)’  
DExpr  ::=  Literal 
  id  
  DExpr (‘+’  ‘’  ‘*’  ‘/’) DExpr  
  (‘+’  ‘’) DExpr  
RVExpr  ::=  ‘Dirichlet’ ‘(’ DExpr ‘,’ DExpr ‘)’ 
  ‘Beta’ ‘(’ DExpr ‘)’  
  ‘Categorical’ ‘(’ Expr ‘)’  
  RVExpr RVArgList  
  id  
RVArgList  ::=  ‘(’ RVExpr ‘)’ [ RVArgList ] 
PlateExpr  ::=  DExpr ‘until’ DExpr 
  DExpr ‘to’ DExpr  
  ‘?’  
  id 
Figure 13 shows the syntax of InferSpark model definition. The expressions in a model definition is divided into 3 categories: deterministic expressions (DExpr), random variable expressions (RVExpr) and plate expressions. The deterministic expressions include literals, class parameters and their arithemetic operations. The random variable expressions define random variables or plates of random variables. The plate expressions define plate of known size or unknown size. The random variables defined by an expression can be binded to an identifier by the value definition. It is also possible for a random variable to be binded to multiple or no identifiers. To uniquely represent the random variables, we assign internal names to them instead of using the identifiers.
Internally, InferSpark represents a Bayesian network in a tree form, where the leaf nodes are random variables and the nonleaf nodes are plates. The edges in the tree represent the nesting relation between plates or between a plate and random variables. The conditional dependencies in the Bayesian network are stored in each node. The root of the tree is a predefined plate TOPLEVEL with size 1. Figure 14 is the internal representation of the twocoin model in Figure 9, where , , , , correspond to , , , , respectively. Plate 1 and Plate 2 correponds to the plates defined on lines 3–5 in Figure 7.
If a plate is nested within another plate, the inner plate is repeated multiple times, in which case, the size attribute of the plate node will be computed by summing the size of each repeated inner plate. We call the size attribute in the tree flattened size of a plate. For example, in Figure 8, the flattened size of the innermost plate around is .
InferSpark recursively descends on the abstract syntax tree (AST) of the model definition to construct the Bayesian network. In the model definition, InferSpark follows the normal lexical scoping rules. InferSpark evaluates the expressions to one of the following three results

a node in the tree

a pair where is a random variable node and plate is a plate node among its ancestors, which represents all the random variables in the plate

a determinstic expression that will be evaluated at run time
At this point, apart from constructing the Bayesian network representation, InferSpark also generates the code for metadata collection, a module used in stage 2. For each random variable name bindings, a singleton interface object is also created in the resulting class. The interface object provides “observe” and “getResult” API for later use.
4.2 Code Generation
Code generation happens at run time. It is divided into 4 steps: metadata collection, message annotation, MPG construction code generation and inference execution code generation.
Metadata collection aims to collect the values of the model parameters, check whether random variables are observed or not, the flattened sizes of the plates. These metadata can help to assign VertexID to the vertices on the message passing graph. After the flattened sizes of plates are calculated, we can assign VertexIDs to the vertices that will be constructed in the message passing graph. Each random variable will be instantiated into a number of vertices on the MPG where the number equals to the flattened size of its innermost plate. The vertices of the same random variable are assigned consecutive IDs. For example, may be assigned ID from to . The intervals of IDs of random variables in the same plate are also consecutive. A possible ID assignment to is to . Using this ID assignment, we can easily i) determine which random variable the vertex is from only by determining which interval the ID lies in; ii) find the ID of the corresponding random variable in the same plate by substracting or adding multiples of the flattened plate size (e.g. if ’ ID is then ’s ID is ).
Message annotation aims to annotate the Bayesian Network Template from the previous stage (Section 4.1) with messages to be used in VMP algorithm. The annotated messages are stored in the form of AST and will be incorporated into the the generated code, output of this stage. The rules of the messages to annotate are predefined according to the derivation of the VMP algorithm. After the messages are generated, we generate for each type of random variable a class with the routines for calculating the messages and updating the vertex.
The generated code for constructing the message passing graph requires building a VertexRDD and an EdgeEDD. The VertexRDD is an RDD of VertexID and vertex attribute pairs. Vertices of different random variables are from different RDDs (e.g., v1, v2, and v3 in Figure 11) and have different initialization methods. For unobserved random variables, the source can be any RDD that has the same number of elements as the vertices instantiated from the random variable. For observed random variables, the source must be the data provided by the user. If the observed random variable is in an unnested plate, the vertex id can be calculated by first combining the indices to the data RDD then adding an offset.
One optimization of constructing the EdgeRDD is to reverse the edges. If the code generation process generates an EdgeRDD in straightforward manner, the aggregateMessages function has to scan all the edges to find edges whose destinations are of type because GraphX indexes the source but not the destination. Therefore, when constructing the EdgeRDD, we generate code that reverses the edge so as to enjoy the indexing feature of GraphX. When constructing the graphs, we also take into account the graph partitioning scheme because that has a strong influence on the performance. We discuss this issue in the next section.
The final part is to generate the inference execution code that implements the iterative update of the VMP algorithm. We aim to generate code that updates each vertex in the message passing graph at least once in each iteration. As it is safe to update vertices that do not have mutual dependencies, i.e., those who do not send messages to one another, we divide each iteration into substeps. Each substep updates a portion of the message passing graph that does not have mutual dependencies.
A substep in each iteration consists of two GraphX operations: aggregateMessages and outerJoinVertices. Suppose g is the message passing graph, the code of a substep is:
The RDD msg does not need to be cached because it is only used once. But the code generated has to cache the graph g because the graph is used twice in both aggregateMessages and outerJoinVertices. However, only caching it is not enough, the code generation has to include a line like 4 above to activate the caching process. Once g is cached, code generation evicts the previous (obsolete) graph prevg from the cache. To avoid the long lineage caused by iteratively updating message passing graph, which will overflow the heap space of the drive, the code generation process also adds a line of code to checkpoint the graph to HDFS every iterations.
4.3 Execution
The generated code at run time are sent to the Scala compiler. The resulting byte code are added to the classpath of both the driver and the workers. Then InferSpark initiates the inference iterations via reflection invocation.
4.4 Discussion on Partitioning Strategies
GraphX adopts a vertexcut partitioning approach. The vertices are replicated in edge partitions instead of edges being replicated in vertex partitions. The four builtin partition strategies in GraphX are: EdgePartition1D (1D), EdgePartition2D (2D), RandomVertexCut (RVC), and CanonicalRandomVertexCut (CRVC). In the following, we first show that these general partitioning strategies perform badly for the VMP algorithm on MPG. Then, we introduce our own partitioning strategy.
Figure 15 shows a more typical message passing graph of a mixture model instead of the toy twocoin model that we have used so far. is the number of and , is the number of , is the number of . Typically, is very large because that is the data size (e.g., number of words in LDA), is a small constant (e.g., number of topics in LDA), and could be a constant or as large as (e.g., number of documents in LDA).
EdgePartition1D essentially is a random partitioning strategy, except that it colocates all the edges with the same source vertex. Suppose all the edges from are assigned to partition . Since there’s an edge from to each one of the vertices , partition will have the replications of all . In the best case, edges from different are assigned to different partitions. Then the largest edge partition still have at least vertices. When is very large, the largest edge partition is also very large, which will easily cause the size of an edge partition to exceed the RDD block size limit. However, the best case turns out to be the worst case when it comes to the number of vertex replications because it actually replicates the size data times, which is extremely space inefficient. The overreplication also incurs large amount of shuffling when we perform outer joins because each updated vertex has to be shipped to every edge partition, prolonging the running time.
We give a more formal analysis of the number of vertices in the largest edge partition and the expected number of replications of under EdgePartition1D. As discussed above, there’s at least one edge partition that has replications of all the ’s. Observe that the graph has an upper bound of vertices, so the number of vertices in the largest edge partition is . Let be the number of replications of , then the expected number of replication of is
EdgePartition2D evenly divides the adjacency matrix of the graph into partitions. The vertices are uniformly distributed along the edges of the adjacency matrix by hashing into buckets. The upper bound of the number of replications of a vertex is because all the edges that point to it are distributed to at most partitions in shown as Figure 16. Meanwhile, there are edges pointing to , so the number of replications of cannot exceed as well. Therefore, the upper bound of replications of is actually . On the other hand, suppose each of the is hashed to different bucket and ’s are evenly distributed into the buckets, then the number of largest partition is at least , which is still huge when the average number of words per partition is fixed. Following is the formal analysis of the EdgePartition2D.
Let be an arbitrary partition in the dark column on Figure 16. Let be the indicator variable for the event that is replicated in Then the expectation of is
The number of vertices in the largest partition is at least the expectation of the number of vertices in a partition, which is also at least the expectation of the number of in it:
The expected number of replications of is
RandomVertexCut (RVC) uniformly assigns each edge to one of the partitions. The expected number of replications of tends to be when is a constant and tends to be when is proportional to the number of partitions. The number of vertices in the largest partition is also excessively large. It is when K is a constant and when is proportional to the number of partitions. CanonicalRandomVertexCut assigns two edges between the same pair of vertices with opposite directions to the same partition. For VMP, it is the same as RandomVertexCut since only the destination end of an edge is replicated. For example, if has incoming edges, then the probability that will be replicated in a particular partition is independent from whether edges in opposite direction are in the same partition or randomly distributed. Therefore CRVC will have the same result as RVC. Table 1 and Table 2 summarize the comparison of different partition strategies.
InferSpark’s partitioning strategy is actually tailormade for VMP’s message passing graph. The intuition is that the MPG has a special structure. For example, in Figure 15, we see that the MPG essentially has “independent” trees rooted at , where the leaf nodes are ’s and they form a complete bipartite graph with all ’s. In this case, one good partitioning strategy is to form partitions, with each tree going to one partition and the ’s getting replicated times. We can see that such a partition strategy incurs no replication on , , and , and incurs replications on .
Generally, our partitioning works as follows: Given an edge, we first determine which two random variables (e.g. and ) are connected by the edge. It is quite straightforward because we assign ID to the set of vertices of the same random variable to a consecutive interval. We only need to look up which interval it is in and what the interval corresponds to. Then we compare the total number of vertices correponding to the two random variables and choose the larger one. Let the Vertex ID range of the larger one to be to . We divide the range from to into subranges. The first subrange is to ; the second is to and so on. If the vertex ID of the edge’s chosen vertex falls into the subrange, the edge is assigned to partition .
In the mixture case, at least one end of every edge is or . Since the number of ’s and ’s are the same, each set of edges that link to the or with the same are colocated. This guarantees that and only appears in one partition. All the ’s are replicated in each of the partitions as before. The only problem is that many with small could be replicated to the same location. In the worst case, the number of in one single partition is exactly . However, it is not an issue in that case because the number of vertices in the largest partition is still a constant . It is also independent from whether or .
Partition Strategy  

1D  
2D  
RVC  
CRVC  
InferSpark  1 
Partition Strategy  

1D  
2D  
RVC  
CRVC  
InferSpark  1 
5 Evaluation
In this section, we present performance evaluation of InferSpark, based on constructing and carrying out statistic inference on three models: Latent Dirichlet Allocation (LDA), SentenceLDA (SLDA) [13], and Dirichlet Compound Multinomial LDA (DCMLDA) [9]. LDA is a standard model in topic modeling, which takes in a collection of documents and infers the topics of the documents. SentenceLDA (SLDA) is a model for finding aspects in online reviews, which takes in online reviews, and infers the aspects. Dirichlet Compound Multinomial LDA (DCMLDA) is another topic model that accounts for burstiness in documents. All models can be implemented in InferSpark using less than 9 lines of code (see Figure 1 and Appendix A). For comparison, we include MLlib in our study whenever applicable. MLlib includes LDA as standard models. However, MLlib does not include SLDA and DCMLDA. There are other probabilistic programming frameworks apart from Infer.NET (see Section 6). All of them are unable to scaleout onto multiple machines yet. Infer.NET so far is the most predominant one with the best performance, so we also include it in our study whenever applicable.
All the experiments are done on nodes running Linux with 2.6GHz quadcore, 32GB memory, and 700GB hard disk. Spark 1.4.1 with scala 2.11.6 is installed on all nodes. The default cluster size for InferSpark and MLlib is 24 data nodes and 1 master node. Infer.NET can only use one such node. The data for running LDA, SLDA, and DCMLDA are listed in Table 3. The wikipedia dataset is the wikidump. Amazon is a dataset of Amazon reviews used in [13]. We run 50 iterations and do checkpointing every 10 iterations for each model on each dataset.




5.1 Overall Performance
Figure 17 shows the time of running LDA, SLDA, and DCMLDA on InferSpark, Infer.NET, and MLlib. Infer.NET cannot finish the inference tasks on all three models within a week. MLlib supports only LDA, and is more efficient than InferSpark in that case. However, we remark that MLlib uses the EM algorithm which only calculates Maximum A Posterior instead of the full posterior and is specific to LDA. In contrast, InferSpark aims to provide a handy programming platform for statistician and domain users to build and test various customized models based on big data. It would not be possible to be done by any current probabilistic frameworks nor with Spark/GraphX directly unless huge programming effort is devoted. MLlib versus InferSpark is similar to C++ programs versus DBMS: highly optimized C++ programs are more efficient, but DBMS achieves good performance with lower development time. From now on, we focus on evaluating the performance of InferSpark.
Table 4 shows the time breakdown of InferSpark. The inference process executed by GraphX, as expected, dominates the running time. The MPG construction step executed by Spark, can finish within two minutes. The Bayesian network construction and code generation can be done in seconds.
Model  B.N. Construction  Code Generation  Execution  Total  

MPG Construction  Inference  
LDA 541644 words  21.911  1.34%  11.15  0.68%  38.147  2.33%  1566.692  95.65%  1637.9 
LDA 1324816 words  21.911  0.70%  12.25  0.39%  79.4  2.55%  3002.1  96.36%  3115.661 
SLDA 349569 words  21.867  1.76%  11.05  0.89%  26.33  2.12%  1182.2  95.23%  1241.447 
SLDA 607430 words  21.867  0.96%  11.69  0.52%  41.152  1.81%  2193.391  96.71%  2268.1 
DCMLDA 1324816 words  22.658  0.65%  10.52  0.30%  20.923  0.60%  3448.699  98.46%  3502.8 
DCMLDA 2596155 words  22.658  0.28%  11.55  0.14%  39.549  0.48%  8153.969  99.10%  8227.726 
5.2 ScalingUp
Figure 18 shows the total running time of LDA, SLDA, and DCMLDA on InferSpark by scaling the data size (in words). InferSpark scales well with the data size. DCMLDA exhibits even superlinear scaleup. This is because as the data size goes up, the probability of selecting larger documents goes up. Consequently, the growth in the total number of random variables is less than proportional, which gives rise to the superlinearity.
5.3 ScalingOut
Figure 19 shows the total running time of LDA on InferSpark in different cluster sizes. For each model, we use fixed size of dataset. DCMLDA and LDA both use the 2% Wikipedia dataset. SLDA uses the 50% amazon dataset. We observe that InferSpark can achieve linear scaleout.
5.4 Partitioning Strategy
Figure 20 shows the running time of LDA(0.2% Wikipedia dataset, 96 topics) on InferSpark using our partitioning strategy and GraphX partitioning strategies: EdgePartition2D (2D) RandomVertexCut (RVC), CanonicalRandomVertexCut (CRVC), and EdgePartition1D (1D). We observe that the running time is propotional to the size of EdgeRDD. Our partition strategy yields the best performance for running VMP on the message passing graphs. Our analysis shows that RVC and CRVC should have the same results. The slight difference in the figure is caused by the randomness of different hash functions.
6 Related Work
To the best of our knowledge, InferSpark is the only framework that can efficiently carry out statistical inference through probabilistic programming on a distributed inmemory computing platform. MLlib, Mahout [2], and MADLib [12] are machine learning libraries on top of distributed computing platforms and relational engines. All of them provide many standard machine learning models such as LDA and SVM. However, when a domain user, say, a machine learning researcher, is devising and testing her customized models with her big data, those libraries cannot help. MLBase [15] is related project that shares a different vision with us. MLBase is a suite of machine learning algorithms and provides a declarative language for users to specify machine learning tasks. Internally, it borrows the concept of query optimizer in traditional databases and has an optimizer that selects the best set of machine learning algorithms (e.g., SVM, Adaboost) for a specific task. InferSpark, on the other hand, goes for a programming language approach, which extends Scala with the emerging probabilistic programming constructs, and carries out statistical inference at scale. MLI [20] is an API on top of MLBase (and Spark) to ease the development of various distributed machine learning algorithms (e.g., SGD). In the same vein as MLI, SystemML [10] provides Rlike language to ease the development of various distributed machine learning algorithms as well. In [23] the authors present techniques to optimize inference algorithms in a probabilistic DBMS.
There are a number of probabilistic programming frameworks other than Infer.NET [17]. For example, Church [11] is a probabilistic programming language based on the functional programming language Scheme. Church programs are interpreted rather than compiled. Random draws from a basic distribution and queries about the execution trace are two additional type of expressions. A Church expression defines a generative model. Queries of a Church expression can be conditioned on any valid church expressions. Nested queries and recursive functions are also supported by Church. Church supports stochasticmemoizer which can be used to express nonparametric models. Despite the expressive power of Church, it cannot scale for large dataset and models. Figaro is a probabilistic programming language implemented as a library in Scala [18]. It is similar to Infer .NET in the way of defining models and performing inferences but put more emphasis to objectorientation. Models are defined by composing instances of Model classes defined in the Figaro library. Infer.NET is a probabilistic programming framework in C# for Bayesian Inference. A rich set of variables are available for model definition. Models are converted to a factor graph on which efficient builtin inference algorithms can be applied. Infer.NET is the best optimized probabilistic programming frameworks so far. Unfortunately, all existing probabilistic programming frameworks including Infer.NET cannot scale out on to a distributed platform.
7 Conclusion
This paper presents InferSpark, a probabilistic programming framework on Spark. Probabilistic programming is an emerging paradigm that allows statistician and domain users to succinctly express a statistical model within a host programming language and transfers the burden of implementing the inference algorithm from the user to the compilers and runtime systems. InferSpark, to our best knowledge, is the first probabilistic programming framework that builts on top of a distributed computing platform. Our empirical evaluation shows that InferSpark can successfully express some known Bayesian models in a very succinct manner and can carry out distributed inference at scale. InferSpark will opensource. The plan is to invite the community to extend InferSpark to support other types of statistical models (e.g., Markov networks) and to support more kinds of inference techniques (e.g., MCMC).
8 Future Directions
Our prototype InferSpark system only implements the variational message passing inference algorithm for certain exponentialconjugate family Bayesian networks (namely mixtures of Categorical distributions with Dirichlet priors). In our future work, we plan to include support for other common types of Bayesian networks (e.g. those with continuous random variables or arbitrary priors). The VMP algorithm may be no longer applicable to these Bayesian networks because they may have nonconjugate priors or distributions out of exponential family. In order to handle wider classes of graphical models, we also plan to incorporate other inference algorithms (e.g. Belief propagation, Gibbs Sampling) into our system, which could be quite challenging because we have to 1) deal with arbitrary models 2) adapt the algorithm to distributed computing framework.
Another interesting future direction is to allow implementation of customized inference algorithms as plugins to the InferSpark compiler. To make the development of customized inference algorithms in InferSpark easier than directly writing them in a distributed computing framework, we plan to 1) revise the semantics of the Inferspark model definition language and expose a clean Bayesian network representation 2) provide a set of frameworkindependent operators for implementing the inference algorithms 3) investigate how to optimize the operators on Spark.
References
 [1] Apache Hadoop. http://hadoop.apache.org/.
 [2] Apache Mahout. http://mahout.apache.org.
 [3] Probabilistic programming. http://probabilisticprogramming.org/wiki/Home.
 [4] Spark MLLib. http://spark.apache.org/mllib/.
 [5] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993–1022, 2003.

[6]
W. Bolstad.
Introduction to Bayesian Statistics
. Wiley, 2004.  [7] S. Chib and E. Greenberg. Understanding the metropolishastings algorithm. The American Statistician, 49(4):327–335, Nov. 1995.
 [8] D. R. Cox. Principles of statistical inference. Cambridge University Press, Cambridge, New York, 2006.
 [9] G. Doyle and C. Elkan. Accounting for burstiness in topic models. Icml, (Dcm):281–288, 2009.
 [10] A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. Systemml: Declarative machine learning on mapreduce. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering, ICDE ’11, pages 231–242, 2011.
 [11] N. D. Goodman, V. K. Mansinghka, D. M. Roy, K. Bonawitz, and J. B. Tenenbaum. Church: a language for generative models. In UAI, 2008.
 [12] J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, and A. Kumar. The madlib analytics library: Or mad skills, the sql. Proc. VLDB Endow., 5(12):1700–1711, Aug. 2012.
 [13] Y. Jo and A. Oh. Aspect and Sentiment Unification Model for Online Review Analysis. In Proceedings of the fourth ACM international conference on Web search and data mining, 2011.
 [14] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. 2009.
 [15] T. Kraska, A. Talwalkar, J. C. Duchi, R. Griffith, M. J. Franklin, and M. I. Jordan. Mlbase: A distributed machinelearning system. In CIDR, 2013.
 [16] Q. Mei, X. Ling, M. Wondra, H. Su, and C. Zhai. Topic sentiment mixture: modeling facets and opinions in weblogs. Proceedings of the 16th international conference on World Wide Web  WWW ’07, page 171, 2007.
 [17] T. Minka, J. Winn, J. Guiver, S. Webster, Y. Zaykov, B. Yangel, A. Spengler, and J. Bronskill. Infer.NET 2.6, 2014. Microsoft Research Cambridge. http://research.microsoft.com/infernet.
 [18] A. Pfeffer. Figaro: An objectoriented probabilistic programming language. Technical report, Charles River Analytics, 2009.
 [19] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A bayesian approach to filtering junk Email. In Learning for Text Categorization: Papers from the 1998 Workshop, Madison, Wisconsin, 1998. AAAI Technical Report WS9805.
 [20] E. R. Sparks, A. Talwalkar, V. Smith, J. Kottalam, X. Pan, J. E. Gonzalez, M. J. Franklin, M. I. Jordan, and T. Kraska. Mli: An api for distributed machine learning. In ICDM, pages 1187–1192, 2013.
 [21] I. Titov and R. McDonald. Modeling Online Reviews with Multigrain Topic Models. 2008.
 [22] I. Titov, I. Titov, R. McDonald, and R. McDonald. A joint model of text and aspect ratings for sentiment summarization. Acl08, 51(June):61801, 2008.
 [23] D. Z. Wang, M. J. Franklin, M. Garofalakis, J. M. Hellerstein, and M. L. Wick. Hybrid indatabase inference for declarative information extraction. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD ’11, pages 517–528, New York, NY, USA, 2011. ACM.
 [24] J. M. Winn and C. M. Bishop. Variational message passing. In JMLR, pages 661–694, 2005.
 [25] R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica. Graphx: A resilient distributed graph system on spark. In GRADES ’13, pages 2:1–2:6.
 [26] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud’10, pages 10–10, Berkeley, CA, USA, 2010. USENIX Association.
Comments
There are no comments yet.