Particle-Gibbs Sampling For Bayesian Feature Allocation Models

01/25/2020 ∙ by Alexandre Bouchard-Côté, et al. ∙ The University of British Columbia 0

Bayesian feature allocation models are a popular tool for modelling data with a combinatorial latent structure. Exact inference in these models is generally intractable and so practitioners typically apply Markov Chain Monte Carlo (MCMC) methods for posterior inference. The most widely used MCMC strategies rely on an element wise Gibbs update of the feature allocation matrix. These element wise updates can be inefficient as features are typically strongly correlated. To overcome this problem we have developed a Gibbs sampler that can update an entire row of the feature allocation matrix in a single move. However, this sampler is impractical for models with a large number of features as the computational complexity scales exponentially in the number of features. We develop a Particle Gibbs sampler that targets the same distribution as the row wise Gibbs updates, but has computational complexity that only grows linearly in the number of features. We compare the performance of our proposed methods to the standard Gibbs sampler using synthetic data from a range of feature allocation models. Our results suggest that row wise updates using the PG methodology can significantly improve the performance of samplers for feature allocation models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bayesian feature allocation models posit that observed data is generated by a collection of latent features with the aim of obtaining an interpretable and sparse representation of the data. A concrete way to represent a feature allocation is using a binary matrix, where the rows of this matrix represent data points or observations and the columns represent features. Common prior distributions for these binary matrices include the finite dimensional Beta-Bernoulli (FBB) model and the non-parametric Indian Buffet process (IBP) (Griffiths and Ghahramani, 2011)

. Exact inference for models which use these prior distributions is generally intractable, so practitioners often appeal to Monte Carlo Markov Chain (MCMC) approaches. A straightforward Gibbs sampler can be derived for such models which proceeds by updating a single entry of the binary matrix conditioned on the values of the remaining entries. While relatively simple to implement, this sampler can be extremely slow to mix due to the correlation among the feature allocation variables. In this work we show that it is possible to derive a simple Gibbs sampler which updates the entire feature usage vector of a data point (row of the binary matrix) jointly. When the number features (columns of the matrix),

, is small this sampler is practical and can significantly improve the mixing of the MCMC chain. However, this sampler is computationally expensive, requiring evaluations of the likelihood. We show that it is possible to sample efficiently from the distribution targeted by this row Gibbs update using the Particle Gibbs (PG) methodology (Andrieu et al., 2010). Our PG sampling approach has computational complexity that scales linearly with the number of features.

In sequel we will first review Bayesian feature allocation models and the relevant prior distributions on binary matrices. Next we will describe the new row wise Gibbs update and explain how to use the PG methodology to efficiently sample from the target distribution. We then compare these new samplers to existing approaches on a range of synthetic datasets using several previously published models. Finally, we conclude with a discussion and some thoughts on future directions.

1.1 Related work

The widely used Gibbs sampler which updates the feature allocation of each data point sequentially was first introduced by Ghahramani and Griffiths (2006). Later Meeds et al. (2007) described the use of Metropolis-Hastings (MH) moves to update multiple components of the feature allocation vector for a data point. They observed that larger moves in the space of feature allocations improved mixing, though this was never formally benchmarked. While the MH move partially addresses the issue of highly correlated features, it becomes impractical as the number of features grows, as large moves proposed at random will increasingly be rejected. An alternative approach to speeding up sampling for feature allocation models was proposed by Doshi-Velez and Ghahramani (2009). The main idea of that work was to partially marginalize elements of the model to improve mixing. This is not a general strategy however, as it requires conjugacy. Wood and Griffiths (2007) proposed the use of particle filters to fit matrix factorization models using IBP priors. In contrast to our approach, they used a single pass particle filter sampling the entire feature allocation matrix. They showed that this approach could significantly outperform single entry Gibbs sampling. However, the single pass particle filter approach does not scale well to models with large numbers of data points or features due to the degeneracy of standard particle filter methods. In contrast PG algorithms are not subject to this degeneracy. Broderick et al. (2013)

pointed out the predictive distribution of the feature allocation models could be written as a product of Bernoulli distributions. However, they did not appear to pursue the obvious row wise Gibbs sampler that this implies.

Fox et al. (2014) proposed the use of split-merge moves to improve the mixing of features. While the sequential nature of these proposals bear some similarity to our method, they differ in that this previous work updates the columns of the feature allocation matrix as opposed to the rows. As a result they need to be interleaved with element wise Gibbs updates to obtain adequate mixing. The methods we propose in this work can be used in place of the element wise Gibbs update with the moves proposed by Fox et al. (2014) to further improve performance.

2 Methods

Here we review the basic background about feature allocation priors and introduce the standard Gibbs updating procedure. We then explain how to implement an exact Gibbs sampler for updating an entire row of the feature allocation matrix. Next we show how to construct a PG sampler to target the conditional distribution the row wise Gibbs samples from. We then discuss two strategies for improving the efficiency of the basic PG algorithm.

2.1 Notation

We use bold letters for (random) vectors, capital letters for matrices and normal fonts for (random) scalars and sets. For quantities such as an individual observation , or a parameter , which can be either scalars or vectors without affecting our methodology, we consider them as scalars without loss of generality. Given a vector , and , we use to denote the sub-vector . For a permutation, , we let denote vector obtained by permuting the entries of by . For a permutation we define the inverse permutation to be the permutation such that

. To simplify notation, we do not distinguish random variables from their realization. We define discrete probability distributions with their probability mass functions, and continuous probability distributions with their density functions with respect to the Lebesgue measure.

2.2 Feature allocation

Intuitively a feature allocation model ascribes a set of features that are exhibited by a set of data points . At the core of these models is the combinatorial stochastic feature allocation object. To formally define a feature allocation we follow the description in Broderick et al. (2013). Let , then a feature allocation of is defined to be a multi-set of non-empty sets of . Let where we refer to the elements as blocks. Each block represents the assignment of data points to a feature. For example consider the feature allocation . In this feature allocation the first feature is exhibited by data point 1, the second feature by data points 1 and 2, and the third feature by data points 2 and 3. In contrast to partitions which are frequently used in clustering models, feature allocations do not require data points to be in mutually exclusive blocks or in fact to be in any block. If we let then we can map the feature allocation to a binary matrix . The rows of represent data points and the columns represent features. We use the notation to denote the row of , that is the vector indicating which features data point uses. Note that the ordering of features is arbitrary so that the matrix is only defined up to a permutation of the columns and strictly speaking the feature allocation prior distribution is defined on the equivalence class of matrices that are identical up to a permutation of their columns. An alternative way to define this equivalence class is as the set of matrices which are equivalent when put into left ordered form (Griffiths and Ghahramani, 2011). In sequel we will abuse notation and not make the distinction between a feature allocation and its binary matrix representation .

2.3 Feature allocation prior distributions

To specify a Bayesian feature allocation model we need to define a prior distribution for the feature allocation. In this work we consider the two most widely used prior distributions for feature allocations, the Finite Beta-Bernoulli (FBB) distribution and Indian Buffet Process (IBP). Below we give: the probability mass function of these distributions; the probability that a data point exhibits feature , ; and the predictive distribution when adding a new data point. The predictive distribution is defined as . Let and , then these quantities are as follows:

  • FBB with features

  • Indian Buffet Process

where is the number of singletons (unique) features exhibited by data point . We note that this definition of the IBP prior differs slightly from the original one defined in Ghahramani and Griffiths (2006). This construction is due to Broderick et al. (2013) and results in an exchangeable prior as the probability mass functions only depend on the number of features and size of blocks. As we will see later this is useful for defining a Gibbs sampler for updating the feature allocation variable.

2.4 Bayesian feature allocation models

To fully specify a Bayesian feature allocation model we need two additional elements. First, a set of parameters associated with the features. We will assume that are drawn i.i.d. from a common distribution so that the features are exchangeable. Second, we need to define a likelihood for our data which depends on our feature allocation matrix and the feature parameters . We also assume that the data points are exchangeable so the likelihood takes the form . In order for the model to remain exchangeable we require that for any permutation, , that

. With these assumptions the full joint distribution is given by Equation 

1.

(1)

In general the component distributions will also depend on additional hyper-parameters which may also have prior distributions. For notational clarity we have suppressed these terms and any dependencies on these hyper-parameters.

As a concrete example consider the linear Gaussian feature allocation model.

This model assumes the feature parameters follow a multivariate normal distribution, and the data follow a multivariate normal distribution with a mean which is the sum of the features that a data point exhibits. Note that the likelihood is invariant to permutations of the feature indexes due to the linear sum construction. This model is thus exchangeable in both data points and features.

2.5 Gibbs updates

Since data points are exchangeable we can use to derive a simple Gibbs sampler to update the entries of by assuming we are observing the last data point to be assigned. Let indicate the entries of minus the row. Let and be defined by replacing in the definition of from the previous section with . The element wise Gibbs update takes the form given by Equation 2 for the FBB model leading to Algorithm 1 for updating a row.

(2)

The update for the IBP prior is slightly more complex and is performed in two parts. We update columns for features which are also exhibited by other data points using the Gibbs update in Algorithm 1 as for the FBB. The columns for features exhibited only by the current data point, singletons, are then updated with another move which leaves the target distribution invariant. The simplest of these is to use a Metropolis-Hastings update where the number of singletons is proposed from the Poisson with parameter , and the corresponding feature values from their prior distributions. The methods we describe in this work only applies to the non-singleton updates, and can be used with any update for the singletons.

1:function ElementWiseGibbsUpdate(, , , )
2:     for  do Iterate over columns in random order.
3:         
4:         
5:         
6:         
7:         
8:         
9:     end for
10:     return
11:end function
Algorithm 1 Sample a row of the feature allocation using the element wise Gibbs update.

2.6 Row wise Gibbs updates

The element wise Gibbs update has been widely used. It only requires evaluations of the likelihood function to update a row. However, the resulting sampler can be extremely slow to mix due to correlations between the features. The form of the predictive distributions for the FBB and IBP priors suggests an alternative Gibbs update that could potentially lead to better mixing. Rather than sample a single entry at a time, instead update an entire row, , of the feature allocation matrix. This can be done by using the update defined by Equation 3 leading to Algorithm 2 for updating a row. Again, this update only applies to the non-singleton entries when using the IBP prior.

(3)

In order to sample from distribution defined by Equation 3 we need to enumerate all possible binary vectors of length and evaluate the likelihood function. This approach leads to a sampler with computational complexity . For moderate values of , particularly if we are using the parametric FBB prior, this is a practical sampler. However, the exponential scaling in will render this approach infeasible for larger numbers of features. This is especially problematic when using the IBP prior, as varies between iterations.

1:function RowWiseGibbsUpdate(, , )
2:      Counter for number of vectors
3:      List to store vectors
4:     for  do Iterate over all possible feature allocation vectors.
5:         
6:          Add to list of visited vectors
7:         
8:     end for
9:     for  do
10:          Normalize probabilities
11:     end for
12:      Sample vector index with probability
13:     
14:     return
15:end function
Algorithm 2 Sample a row of the feature allocation using the row wise Gibbs update.

2.7 Particle Gibbs updates

Figure 1: Illustration of the PG update procedure. Note we suppress the random ordering of features defined by for clarity. (top) Select a row for the update, shown in grey. (middle) Run a conditional particle filter and sample new row. (bottom) Update row with sample for particle filter, shown in grey. The stars (*) indicate values of the test path. We discuss how these values can be selected in Section 3.2.4.

We now describe how to sample from with computational complexity by using the Particle Gibbs (PG) methodology (Andrieu et al., 2010). PG sampling is a form of Sequential Monte Carlo (SMC) sampling (Doucet and Johansen, 2009). Like all SMC algorithms the PG approach proceeds by approximating a sequence of distribution using a set of interacting particles. Resampling is periodically used to prune particles which, informally, are exploring low probability regions. There are three key quantities that need to be defined when constructing an SMC sampler:

  1. The sequence of target distributions used to weigh the particles at each time step.

  2. The sequence of proposal distributions used to extend particles between time steps.

  3. The resampling distribution .

The key difference between PG and other SMC approaches is that we are updating a set of variables which have already been instantiated. We would like to do this in way that leads to a valid kernel targeting the conditional distribution . To accomplish this we need to include a conditional path, that is a particle trajectory which follows the sequence of choices required to generate the initial value before the update. This trajectory will always be included after the resampling steps. Thus the resampling step is conditional on including the particle representing this trajectory. Intuitively this forces the sampler to explore regions of space around the existing value. To simplify the bookkeeping and algorithm implementation we always assume the first particle is the conditional path. The PG sampler is still valid when this is done as shown by Chopin et al. (2015).

SMC algorithms are commonly used for models with a natural sequential structure, such as state space models. This in turn identifies a natural sequence of target distributions defined on an expanding state space. Our setup is non-standard in that no natural sequential structure is defined. To sample from we will define a sequence of distribution which updates one entry of the feature allocation vector at each time step. Thus if we have features we will define a sequence of target distributions. For the FBB we take to be the fixed value of and update all elements. For the IBP is taken to be the number of elements such and we only update the corresponding feature assignments.

At time step of the algorithm the particles will take values , that is we record the sequence of binary decisions up to point . In order to evaluate the likelihood term we need to set the values of the feature vector which have not been updated at time . To do this we introduce an auxiliary variable which we call the the test path denoted . We discuss and compare possible strategies for selecting later. An illustration of the method is given in Figure 1.

We randomly order the features before each update by a permutation so that at time we sample component of the feature allocation vector. To obtain a complete feature vector to evaluate the likelihood we define the function given by Equation 4 which returns a binary vector where the entries have been set to the sampled values and the remaining entries are set to the test path. The entries are then reordered by the inverse permutation .

(4)

For the IBP the singleton entries are fixed to one and deterministically inserted when evaluating the likelihood.

We use the sequence of target distributions defined in Equation 5. We have , that it the target density at the final iteration is proportional to the density of the distribution of interest. This is the key constraint required to define a valid sequence of target distributions.

(5)

The second component we need for our algorithm is a sequence of proposal distributions. Here we exploit the fact that our proposal space is and use the fully adapted proposal kernel defined in Equations 6 and 7. We use to denote the concatenation to and .

(6)
(7)

Given our choice of proposal and target distributions the incremental weight functions are defined by Equations 8 and 9. To reduce computational overhead can be cached to avoid re-evaluation of the likelihood term in the denominator of Equation 9.

(8)
(9)

The final component we need to define our PG algorithm is a resampling distribution. For simplicity we use multinomial resampling, however more sophisticated approaches such as stratified sampling could also be use. Our resampling distribution deterministically includes the conditional path, which we arbitrarily assign to particle index 1. The conditional multinomial resampling distribution is given by Equation 10 where is the vector of ancestor indices, the vector of normalized particle weights and is the number of particles.

(10)
1:function ParticleGibbsUpdate(, , , )
2:      Set conditional path
3:     for  do
4:          First particle of each generation matches conditional path
5:     end for
6:     for  do Initialize unconditional particles
7:         
8:         
9:     end for
10:     for  do Initialize incremental importance weights
11:         
12:     end for
13:     for  do Compute normalized weights
14:         
15:     end for
16:     for  do
17:         if  then Resample only if the relative ESS below threshold
18:               Conditional resampling
19:               Reset incremental weights to one
20:         else
21:               Resampling skipped set to identity map
22:         end if
23:         for  do Propose new feature usage for feature
24:              
25:              
26:         end for
27:         for  do
28:               Update incremental importance weights
29:         end for
30:         for  do
31:               Compute normalized weights
32:         end for
33:     end for
34:      Sample updated feature allocation
35:      Reorder sampled feature allocation vector by inverse of
36:     return
37:end function
Algorithm 3 Sample a row of the feature allocation using the particle Gibbs update.

2.8 Annealed target distributions

One potential pitfall of the target distribution is that due to the correlation among features, it is difficult to change a feature from its current values. This will be particularly acute if there is a need to move through a low probability configuration. A simple strategy to mitigate this is to consider an different family of target distributions which anneals the likelihood defined in Equation 11.

(11)

It can easily be checked that so this sequence of target densities does indeed target the correct distribution. Also note, the original sequence of densities is recovered if .

2.9 Discrete particle filtering

SMC algorithms are known to be inefficient in cases where the target distribution is discrete. This is due to the computation and storage of redundant particles. Fearnhead and Clifford (2003) addressed this problem by designing an SMC approach tailored to discrete state spaces. The key difference is that their approach deterministically expands each particle to test all available extensions, which is possible due to the discrete nature of the space. In order to avoid storing an exponentially expanding system of particles, they introduce an approach to deterministically keep particles with high weights while resampling from those with low weights. Their approach guarantees that no more that particles will be created, where is the discrete state space and is a user specified value. Whiteley et al. (2010) later showed that this approach could be adapted to the Particle Gibbs framework. We refer to this approach as the discrete particle filter (DPF). In practice we use a slightly different version which was proposed by Barembruch et al. (2009). This version sets the expected number of particles to instead of fixing it at exactly . We have found this implementation to be more stable numerically. The resampling procedure is outlined in Algorithm 4.

There is no proposal densities in the DPF algorithm so we obtain a different set of weight functions from the PG algorithm which are given by Equations 12 and 13. As for the PG algorithm it is useful to cache to avoid re-evaluation of the likelihood term in the denominator. When using annealing the corresponding target densities are substituted in the weight functions. The full details of the DPF are given in Algorithm 5. Again to simplify the bookkeeping our proposed algorithm always assigns the conditional path to the first particle, and this is enforced during resampling.

(12)
(13)
1:function ResampleDPF(w, , )
2:      findRoot() Find unique root to equation on right
3:      First index is conditional path
4:      Initialize number of retained particles
5:     if  then
6:         
7:     else
8:         
9:     end if
10:     for  do
11:         if  then Retain particles with large weights
12:              
13:              
14:              
15:         else Resample particles with small weights
16:              
17:              if  then
18:                  
19:                  
20:                  
21:              end if
22:         end if
23:     end for
24:     for  do
25:          Normalize new weights
26:     end for
27:     return , ,
28:end function
Algorithm 4 Conditional resampling for DPF.
1:function DiscreteParticleFilter(, , , , )
2:      Set conditional path
3:     for  do
4:          First particle of each generation matches conditional path
5:          Expand conditional path with alternate feature value for time
6:     end for
7:      Initialize number of particles
8:     for  do
9:          Initialize incremental importance weights
10:     end for
11:     for  do
12:          Compute normalized weights
13:     end for
14:     for  do
15:         if  then Check if there are too many particles
16:              , , ResampleDPF(, , ) Resample using Algorithm 4
17:         else
18:               Set ancestor indices to identity map
19:         end if
20:          Track number of particles
21:         for  do
22:              for  do
23:                  
24:                   Expand unconditional particles
25:              end for
26:         end for
27:          Update number of particles
28:         for  do
29:               Update incremental importance weights
30:         end for
31:         for  do
32:               Compute normalized weights
33:         end for
34:     end for
35:      Sample updated feature allocation
36:      Reorder sampled feature allocation vector by inverse of
37:     return
38:end function
Algorithm 5 Sample a row of the feature allocation using the discrete particle filter update.

3 Results

We first demonstrate the potential slow mixing of the standard Gibbs sampler on a toy dataset and illustrate how the row wise Gibbs updates can alleviate this problem. Next we explore how to tune the parameters of the PG and DPF samplers. We then compare the behaviour of the Gibbs sampler and our proposed methods on a number of synthetic datasets.

We have compared the performance of the Gibbs, Row Gibbs (RG), Particle Gibbs (RG) and Discrete Particle Filter (DPF) using three models. The first model we tested with was the Linear Gaussian (LG) model, which has been widely used in the IBP literature (Griffiths and Ghahramani, 2011). The second model we considered was the Latent Feature Relational Model (LFRM) proposed by Miller et al. (2009). The final model we consider is a modified version of the PyClone model used for inferring population structure from admixed data in cancer genomics (Roth et al., 2014). The original PyClone model clusters sets of mutations which appear in a similar proportion of cells. We have modified this model to use feature allocations to indicate which cell populations have each mutation. Full details of the models and the updates used for parameters are in the Section 5.1.

When comparing methods we applied the Friedman test to see if there were any significant difference in performance between the methods (p-value 0.001). If the Friedman test was significant we then applied the post-hoc Nemenyi test with a Bonferroni correction to all pairs of models to determine which models showed significantly different performance from each other (p-value 0.001) (Demšar, 2006). All statements of significance are with respect to this test. Because the samplers have different computational complexity per iteration, we report the results using wall clock time instead of per iteration. This ensures a fair comparison, as for example, we can perform many more updates using the Gibbs sampler than the PG sampler in a given time interval. We report the relative log density when comparing methods to better represent how far away from convergence the samplers are. Let be the log density of the data under the true parameters used for simulation and the observed log density. The relative log density is given by .

Code implementing the samplers and models is available online at https://github.com/aroth85/pgfa. All experiments were done using version 0.2.2 of the software. Code for performing the experiments is available online at https://github.com/aroth85/pgfa_experiments.

3.1 Row updates improve mixing

Figure 2: Comparison of element wise Gibbs to row Gibbs sampler. a) Log joint probability of the samplers over time. b) Number of data points assigned to each feature over time. Lines for the Gibbs sampler are jittered away from 50 for visibility.

To illustrate the potential benefits of using row wise updates, we first consider a simple pedagogical example. We simulated data points from the linear Gaussian model with , , , and . We set the value of the feature parameters to be 100 for both features with half of the data points using the first feature and half using the second feature. For inference we used the FBB prior with , and . This prior distribution for the feature allocation heavily favours the configuration where all data points use one feature and the other is not used. Because both features have identical values, there is no difference is likelihood for a data point to use one feature or the other. Thus, if a sampler is mixing efficiently it should quickly assign all data points to one feature and none to the other. We ran both the element wise Gibbs and row wise Gibbs samplers for 100 seconds recording the value of the log joint probability and number of data points that used each feature at each iteration. We set all model parameters except the feature allocation to their true values, and did not update them in contrast to the remaining experiments where the feature parameters are updated. We show the trace of the log joint probability in Figure 2 a). The element wise Gibbs sampler (blue) is clearly trapped in a local mode from initialization and cannot move away from the initial configuration. This is due to the need for the element wise Gibbs sampler to traverse a region of low probability to use the other feature. Specifically, a data point must use both features or neither feature in one iteration before it can then only use the other feature in the next iteration. Figure 2 b) supports this hypothesis as we see that the number of data points using each feature never changes over the course of sampling. In contrast the row wise Gibbs sampler (red) rapidly increases the joint probability Figure 2 a) and moves all data points to a single feature Figure 2 b). This contrived example clearly illustrates the potential for slow mixing that element wise updates can cause and that row wise updates can solve the problem. We will see that this behaviour is a general phenomenon of the element wise Gibbs sampler, even when the initialization is not constructed to be adversarial as in this case.

3.2 Setting tuning parameters

The PG and DPF samplers have a number of tuning parameters which affect performance. We explored the impact these parameters have on performance using synthetic data generated from the LG model. We generated four datasets and four sets of initial parameter values. For all combinations of datasets and initial parameters we performed five random restarts of the sampler. Thus we executed 80 chains for each method considered, all with the same data and parameter initialization. Data was simulated from the LG model using the FBB prior with , , , , and . These parameters were chosen to generate datasets where we would expect the sampler to converge to the true parameter values used for simulation. We randomly assigned 10% of the data matrix to be missing and used these entries to compute root mean square reconstruction error (RMSE). For each experiment we varied a single tuning parameter, setting the remaining parameters to default values, namely an annealing power of 1.0, a number of particles of 20, a resampling threshold of 0.5, and test paths consisting of a vector of zeros.

3.2.1 Number of particles

The first parameter we explore is the number of particles. For standard SMC algorithms a large number of particles are typically used, as this parameter ultimately controls the quality of the Monte Carlo approximation. In contrast to standard SMC, the Particle Gibbs framework is less sensitive to the number of particles. This is primarily a result of the fact many conditional SMC (cSMC) moves can be used within a sampling run, in contrast to the one shot approach of SMC. For our particular problem the length of the cSMC runs also tends to be short because we rarely expect very large numbers of features to be used in the model. As a result, our updates will also be less sensitive to path degeneracy.

We benchmarked the PG and DPF algorithms using varying number of particles (Figures S5-S7). Both algorithms appear to be relatively insensitive to the number of particles used. Using 50 and 100 particles leads to significantly worse performance for both the PG (Tables S1 and S2) and DPF samplers (Tables S3 and S4) than using fewer particles after the algorithms have run for 10 seconds. However, after 1000 seconds there were no significant differences between the runs with different numbers of particles for either method. One surprising feature is that runs using as few as two particles still work well. We caution this observation may not hold for other models or larger numbers of features, and more particle may be required. We also note that it is possible to parallelize these samplers across particles which would allow for more particles to be used, though we did not investigate this.

3.2.2 Resampling threshold

We use an adaptive resampling scheme for the PG algorithm, whereby resampling only occurs if the relative effective sample size (ESS) falls below a specified threshold. The DPF algorithm does not require this tuning parameter as the resampling mechanism is deterministic. Figures S8 and S9 shows the results of the benchmark experiment. The performance of the PG algorithm was insensitive to the value of this parameter with the exception of using a threshold of 1.0 which corresponds to always resampling. Always resampling performed significantly worse (Tables S5 and S6) than several other thresholds at all time points. Somewhat surprisingly when the resampling threshold is 0.0, that is never resampling, the sampler still performed well.

3.2.3 Annealing power

As discussed in the methods we can use a sequence of target distribution which anneals the data likelihood. In principle this allows the method to defer resampling away particles with low data likelihood at early stages. We explore the impact of the annealing parameter in Figures S10-S12. The PG sampler using no annealing, that is setting the power to zero, performed significantly worse (Tables S7 and S8). The actual value of the annealing power seemed to be less important provided it was larger than zero. The DPF sampler was generally insensitive to this parameter. The only significant difference observed was between using a power of 0.0 and 3.0 after 10 seconds (Tables S9 and S10) and this difference disappears for later times. This is likely due to the fact all possible paths from early time steps are included in by the DPF sampler, and are not resampled away.

3.2.4 Test path

In order to evaluate the data likelihood term in the target distributions, we must instantiate the values of the feature allocation vector that have not been updated yet. We consider several strategies for doing so:

  • Conditional: Use the value of the conditional path.

  • Ones: Set the value of all features to one.

  • Random: Draw the value of the feature vector uniformly at random.

  • Two stage: Run an unconditional SMC sampler using the conditional path to draw a test path.

  • Unconditional: Similar to two stage but using zeros as the test path for the first pass unconditional SMC.

  • Zeros: Set the value of all features to zero.

The Conditional and Two Stage strategies do not lead to valid Gibbs updates due to the dependency on the conditional path. However, we include them in this analysis as they could be used during a burnin phase. After burnin, another strategy which does lead to a valid Gibbs update could be used. The Two Stage and Unconditional strategies both use a pilot run of unconditional SMC. This increases run time, and introduces additional tuning parameters. For the purpose of this experiment, we set the tuning parameters of both the SMC and cSMC passes to the same values.

Figures S13-S15 show the results of the experiment. For the PG sampler the Ones and Random test paths performed significantly worse than other approaches. At early time points the Conditional and Zeros strategies were the best, but at later time points the Two Stage and Unconstrained approaches were not significantly worse (Tables S11 and S12). For the DPF algorithm the Conditional, Random, and Zeros methods significantly outperformed other approaches after 10 seconds (Tables S13 and S14). Both the Conditional and Zeros methods significantly outperformed the Random method at this time. For later time points no methods had significantly different performance. This result suggests that the simple approach of using a test path of zeros is effective, though there may still be some benefit of using the Conditional strategy for burnin. This experiment also suggests that the PG sampler is sensitive to this parameter, whereas the DPF sampler is quite robust.

3.2.5 Summary

Based on these results we used the following parameter values for subsequent experiments.

  • Annealing power - 1.0

  • Number of particles - 20

  • Resample threshold - 0.5

  • Test path - Zeros

These were not necessarily the optimal parameters based on the experiments, but were reasonably close to optimal. Note we use the Zeros test path strategy to ensure we have a valid Markov Chain kernel targeting the correct distribution.

3.3 Method comparison

To compare the performance of our proposed approaches to the standard Gibbs sampler we generated synthetic data from three feature allocation models. For all comparisons we ran 80 chains for each sampler as in the tuning experiments. We simulated data with parameter values which should lead to easily identifiable solutions and thus we would expect the samplers to converge to a distribution concentrated on the parameters used for simulation.

3.3.1 Linear Gaussian model

Figure 3: Performance of different samplers using synthetic data from the LG model. The box plots represent the distribution of values from 80 random starts of each parameter setting. We show the values of the relative log density (left, higher is better) and root mean square error reconstruction of missing values (right, lower is better).

We generated datasets using two sets of model parameters. The first dataset was simulated using the FBB prior and and the second was simulated using the FBB model with . We fit the second dataset using both the FBB prior with and the IBP prior. For both datasets we simulated data points from the linear Gaussian model with , , and .

The results of these experiments are shown in Figure 3 and Figures S16-S18. For the first experiment with it was computationally feasible to use the RG sampler. Because we use 20 particles for the DPF algorithm it is equivalent to the RG sampler in this case. The RG sampler serves as the gold standard for the DPF and PG methods in this experiment. For the dataset the RG sampler significantly outperforms the Gibbs sampler, supporting the results of our initial toy data experiment (Tables S15 and S16). The DPF and RG samplers do not perform significantly different as expected, and outperform the other two approaches. The PG sampler does not significantly outperform the Gibbs sampler.

For the second dataset, fitting the FBB prior model () the DPF sampler does not significantly outperform the Gibbs sampler after 100 seconds (Tables S17 and S18). However, for longer runs the performance advantage of the DPF sampler becomes significant. At the earliest time point the Gibbs sampler significantly outperforms the PG sampler, but the situation reverse at later time points.

The results are somewhat different for the third experiment fitting the IBP model. In this case we see that the Gibbs sampler outperformed the PG sampler at early time points (Tables S19 and S20). As the samplers were run longer the PG sampler began to outperform the Gibbs sampler. The DPF sampler outperforms both approaches. One explanation for the better performance of the Gibbs sampler over the PG sampler is that the Gibbs sampler can propose more moves to alter the dimensionality of the model in the same time period. Thus during the burnin phase the Gibbs sampler can more efficiently move the model to the correct number of features which improves performance. However, the fact the DPF sampler outperforms both, suggests that the ability to perform efficient updates on the non-singleton columns dominates this effect.

3.3.2 Latent Feature Relational Model

We next explored performance using the LFRM model described by Miller et al. (2009). As for the LG experiment, we generated datasets using two sets of model parameters. We fit the second dataset using both the FBB prior with and the IBP prior. We again executed 80 runs for all samplers using the same strategy as previous experiments. We simulated data points with parameters and from the non-symmetric LFRM model. We randomly assigned 5% of the data matrix to be missing. In addition to the relative log density we report the reconstruction error of the model for the entire data matrix, both observed and missing values.

The result of these experiments are shown in Figures S19-S22. The RG and DPF methods significantly outperformed the other two methods for the K=5 experiment in terms of relative log density (Tables S21 and S22). However, the difference in reconstruction error was not significant. There was no significant difference between samplers for the other two runs (Tables S23-S26).

3.3.3 PyClone model

Figure 4: Performance of different samplers using synthetic data from the PyClone model. The box plots represent the distribution of values from 80 random starts of each parameter setting. We show the values of the relative log density (right) and B-Cubed F-measure (left).

The final model we tested with was a modified version of the PyClone model described in Roth et al. (2014). The modifications are described in the supplement. We simulated three datasets using the FBB prior with , with data points. For the first dataset we set and , the second we set and and the third and . We did not fit the model using the IBP prior as we could not develop an efficient proposal for updating singleton entries. In addition to the relative log density we computed the B-Cubed F-Measure (Amigó et al., 2009). The B-Cubed metric is a measure of feature allocation accuracy analogous to the V-Measure metric (Rosenberg and Hirschberg, 2007) used to evaluate clustering algorithms. We focused on feature allocation accuracy as the features are interpretable quantities that we wish to infer in this application.

The results of the experiments are shown in Figure 4 and Figures S23-S25. Note that we exclude the PG method from these figures as the performance was so poor as to obscure the scales of the plots. For the first dataset the RG and DPF sampler both outperformed the other approaches (Tables S27 and S28). The PG approach performed significantly worse than all other approaches including the Gibbs sampler. The two other datasets presented similar trends, with the DPF sampler outperforming both approaches and the PG sampler performing the worst (Tables S29-S32). The performance of the Gibbs sampler did not improve from times 1000 to 10000. Both the PG and DPF samplers show improved performance as sampling is run for longer. This suggests that the Gibbs sampler is potentially trapped in the vicinity of a local optima, which it cannot escape from.

4 Discussion

In this work we have developed several methods for updating an entire row of a feature allocation matrix. Our results suggest that such samplers can significantly improve performance compared to the widely used single entry Gibbs sampler. Directly implementing row wise Gibbs updates is intractable for more than a small number of features due to the exponential number of feature allocations. We overcome this limitation by using the PG methodology to develop an algorithm which scales linearly in the number of features. When coupled with the DPF framework we obtain significantly better performance than the standard Gibbs sampler. The use of the DPF framework appears to be critical, as the standard PG sampler did not always perform well. In particular, the performance of the PG sampler when applied to the PyClone model was significantly worse than the standard Gibbs sampler. However, the DPF approach significantly outperformed both the Gibbs and PG methods. Furthermore, when applied to models such as the LFRM where the standard Gibbs sampler performs well, our approach does not perform significantly worse. This suggests that despite the increased computational complexity of the PG framework, there is little downside to employing this approach. Taken together our results suggest the DPF algorithm is a computationally efficient and generally applicable approach for performing Bayesian inference for feature allocation models. Our algorithm is applicable to both the parametric FBB and non-parametric IBP model.

We have focused on developing row wise updates for the feature allocation matrix. When applied to the parametric FBB model these updates can significantly improve performance. However, when applied to the non-parametric models using the IBP prior we did not see consistent improvement. We believe a major problem in the non-parametric regime is the updates for the singleton features. The most general approach of using MH updates with proposals from the priors seems to lead to very slow mixing. While this is an issue for the Gibbs sampler as well, the low computational cost of updating non-singleton entries allows this sampler to perform more singleton updates. We believe that the development of efficient schemes to update the columns in a single move will be particularly useful. This has already been explored to some extent in

Fox et al. (2014), where split-merge moves are used as proposals for an MH update. It should be possible to further improve upon these split-merge style moves using the PG framework, in a similar way to what has been done in the Bayesian clustering literature (Bouchard-Côté et al., 2017). Such updates would complement the approach we have developed in this work.

We have not exploited the potential for performing parallel computation that is offered by the PG framework. In particular we could parallelize any loops over particles in Algorithm 3 which could potentially yield significant speed-ups. It has been noted by Whiteley et al. (2010) and Lindsten et al. (2014) that the use of backward or ancestor sampling can significantly reduce the effect of path degeneracy for SMC models. These approaches could naturally be combined with our method, and could allow for the use of fewer particles.

References

  • Amigó et al. (2009) Enrique Amigó, Julio Gonzalo, Javier Artiles, and Felisa Verdejo.

    A comparison of extrinsic clustering evaluation metrics based on formal constraints.

    Information retrieval, 12(4):461–486, 2009.
  • Andrieu et al. (2010) Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle markov chain monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010.
  • Barembruch et al. (2009) Steffen Barembruch, Aurélien Garivier, and Eric Moulines.

    On approximate maximum-likelihood methods for blind identification: How to cope with the curse of dimensionality.

    IEEE Transactions on Signal Processing, 57(11):4247–4259, 2009.
  • Bouchard-Côté et al. (2017) Alexandre Bouchard-Côté, Arnaud Doucet, and Andrew Roth. Particle gibbs split-merge sampling for bayesian inference in mixture models.

    The Journal of Machine Learning Research

    , 18(1):868–906, 2017.
  • Broderick et al. (2013) Tamara Broderick, Michael I Jordan, Jim Pitman, et al. Cluster and feature modeling from combinatorial stochastic processes. Statistical Science, 28(3):289–312, 2013.
  • Chopin et al. (2015) Nicolas Chopin, Sumeetpal S Singh, et al. On particle gibbs sampling. Bernoulli, 21(3):1855–1883, 2015.
  • Demšar (2006) Janez Demšar.

    Statistical comparisons of classifiers over multiple data sets.

    Journal of Machine learning research, 7(Jan):1–30, 2006.
  • Doshi-Velez and Ghahramani (2009) Finale Doshi-Velez and Zoubin Ghahramani. Accelerated sampling for the indian buffet process. In Proceedings of the 26th annual international conference on machine learning, pages 273–280. ACM, 2009.
  • Doucet and Johansen (2009) Arnaud Doucet and Adam M Johansen. A tutorial on particle filtering and smoothing: Fifteen years later.

    Handbook of nonlinear filtering

    , 12(656-704):3, 2009.
  • Fearnhead and Clifford (2003) Paul Fearnhead and Peter Clifford.

    On-line inference for hidden markov models via particle filters.

    Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(4):887–899, 2003.
  • Fox et al. (2014) Emily B Fox, Michael C Hughes, Erik B Sudderth, Michael I Jordan, et al. Joint modeling of multiple time series via the beta process with application to motion capture segmentation. The Annals of Applied Statistics, 8(3):1281–1313, 2014.
  • Ghahramani and Griffiths (2006) Zoubin Ghahramani and Thomas L Griffiths. Infinite latent feature models and the indian buffet process. In Advances in neural information processing systems, pages 475–482, 2006.
  • Griffiths and Ghahramani (2011) Thomas L Griffiths and Zoubin Ghahramani. The indian buffet process: An introduction and review. Journal of Machine Learning Research, 12(Apr):1185–1224, 2011.
  • Lindsten et al. (2014) Fredrik Lindsten, Michael I Jordan, and Thomas B Schön. Particle gibbs with ancestor sampling. The Journal of Machine Learning Research, 15(1):2145–2184, 2014.
  • Liu (2008) Jun S Liu. Monte Carlo strategies in scientific computing. Springer Science & Business Media, 2008.
  • Meeds et al. (2007) Edward Meeds, Zoubin Ghahramani, Radford M Neal, and Sam T Roweis. Modeling dyadic data with binary latent factors. In Advances in neural information processing systems, pages 977–984, 2007.
  • Miller et al. (2009) Kurt Miller, Michael I Jordan, and Thomas L Griffiths. Nonparametric latent feature models for link prediction. In Advances in neural information processing systems, pages 1276–1284, 2009.
  • Rosenberg and Hirschberg (2007) Andrew Rosenberg and Julia Hirschberg. V-measure: A conditional entropy-based external cluster evaluation measure. In

    Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL)

    , pages 410–420, 2007.
  • Roth et al. (2014) Andrew Roth, Jaswinder Khattra, Damian Yap, Adrian Wan, Emma Laks, Justina Biele, Gavin Ha, Samuel Aparicio, Alexandre Bouchard-Côté, and Sohrab P Shah. Pyclone: statistical inference of clonal population structure in cancer. Nature methods, 11(4):396, 2014.
  • Whiteley et al. (2010) Nick Whiteley, Christophe Andrieu, and Arnaud Doucet. Efficient bayesian inference for switching state-space models using discrete particle markov chain monte carlo methods. arXiv preprint arXiv:1011.2437, 2010.
  • Wood and Griffiths (2007) Frank Wood and Thomas L Griffiths. Particle filtering for nonparametric bayesian matrix factorization. In Advances in neural information processing systems, pages 1513–1520, 2007.

5 Appendix

5.1 Models

We describe the three models we used for performance comparisons. We use the notation to describe sampling from a feature allocation distribution, either the FBB or IBP priors. The number of features is implicitly determined by the number of columns of . We place a prior on the hyper-parameter and use a random walk Metropolis-Hastings kernel to update the variable.

When referring to the Normal distribution we use the mean/precision parametrization. When referring to the Gamma distribution we use the shape/rate parametrization.

5.1.1 Linear Gaussian

The linear Gaussian model has been widely used, particularly in the IBP literature (see Griffiths and Ghahramani (2011) for example). One reason for the model’s popularity is that it is possible to marginalize the feature parameters, so a collapsed sampler can be developed. In this work we do not exploit this, and instead work with uncollapsed model. The hierarchical model is as follows:

We use a Gibbs kernels to update , and . When using the IBP prior we use a collapsed Metropolis-Hastings step to update the singletons (Doshi-Velez and Ghahramani, 2009).

5.1.2 Linear Feature Relational Model

The LFRM model was proposed by Miller et al. (2009). The observed data is a binary matrix which encodes interactions between entities. It could for example be used to model relationships on a social network. The model posits that an underlying set of features encoded by governs whether the entries in are on or off. The hierarchical model is as follows:

where . Note that the model can be symmetric so that or non-symmetric by letting these parameters vary independently. We use random walk Metropolis-Hastings kernels to update and . When using the IBP prior we use a Metropolis-Hastings kernel with proposals from the prior to update the singletons.

5.2 PyClone

The original PyClone model was proposed by Roth et al. (2014). The model assumes we observe data which represent the number of sequencing reads without and with mutation in sample . We refer to as the sequencing depth. We refer to the proportion of cells with mutation in sample , , as the cellular prevalence. We can model the probability of observing reads with the mutation in the samples by a density where indicates other quantities which are not relevant to the discussion. In the original PyClone model is assumed to be sampled from a Dirichlet process so that mutations appearing at similar cellular prevalences are clustered. This corresponds to the biological assumption mutations appear within sub-populations of cells, and that the cellular prevalence is the sum of the proportion of cells in the sub-populations containing the mutation. We can alter this model to explicitly identify which sub-populations have the mutation using a feature allocation model. Let be the proportion of cells from population in sample . We use the feature allocation vector for mutation to encode which sub-populations have the mutation. The cellular prevalence is then given by . Substituting this into the observation density gives the new model. The hierarchical model is as follows:

Updating was somewhat difficult for this model so we used a number of MCMC kernels which included random walk Metropolis-Hastings kernels on either individual values or blocks. We also used a Metropolis-Hastings kernel where the proposal was a random permutation of the values for a sample. The final kernel was the Multiple-Try-Metropolis kernel (Liu, 2008).

5.3 Supplementary figures

Figure S5: Trace plots of PG sampler using different number of particles. Rows are datasets and columns are initial parameter settings. Error bars are averaged over five restarts of the sampler with different random seeds.
Figure S6: Trace plots of DPF sampler using different number of particles. Rows are datasets and columns are initial parameter settings. Error bars are averaged over five restarts of the sampler with different random seeds.
Figure S7: Performance of the PG (top) and DPF (bottom) samplers as function of the number of particles. The box plots represent the distribution of values from 80 random starts of each parameter setting. We show the values of the relative log density (right) and root mean square error reconstruction of missing values (left).
Figure S8: Trace plots of PG sampler using different resampling thresholds. Rows are datasets and columns are initial parameter settings. Error bars are averaged over five restarts of the sampler with different random seeds.
Figure S9: Performance of the PG samplers as function of the resampling threshold. The box plots represent the distribution of values from 80 random starts of each parameter setting. We show the values of the relative log density (right) and root mean square error reconstruction of missing values (left).
Figure S10: Trace plots of PG sampler using different annealing powers. Rows are datasets and columns are initial parameter settings. Error bars are averaged over five restarts of the sampler with different random seeds.
Figure S11: Trace plots of DPF sampler using different annealing powers. Rows are datasets and columns are initial parameter settings. Error bars are averaged over five restarts of the sampler with different random seeds.
Figure S12: Performance of the PG (top) and DPF (bottom) samplers as function of the annealing power of the intermediate target distribution. The box plots represent the distribution of values from 80 random starts of each parameter setting. We show the values of the relative log density (right) and root mean square error reconstruction of missing values (left).
Figure S13: Trace plots of PG sampler using different test paths. Rows are datasets and columns are initial parameter settings. Error bars are averaged over five restarts of the sampler with different random seeds.
Figure S14: Trace plots of DPF sampler using different test paths. Rows are datasets and columns are initial parameter settings. Error bars are averaged over five restarts of the sampler with different random seeds.
Figure S15: Performance of the PG (top) and DPF (bottom) samplers as function of the test path used to evaluate the data likelihood. The box plots represent the distribution of values from 80 random starts of each parameter setting. We show the values of the relative log density (right) and root mean square error reconstruction of missing values (left).
Figure S16: Trace plots of sampling algorithms with simulated data from the linear Gaussian model with K=5. See main text for other model parameters. Rows are datasets and columns are initial parameter settings. Error bars are averaged over five restarts of the sampler with different random seeds.
Figure S17: Trace plots of sampling algorithms with simulated data from the linear Gaussian model with K=20. See main text for other model parameters. Rows are datasets and columns are initial parameter settings. Error bars are averaged over five restarts of the sampler with different random seeds.
Figure S18: Trace plots of sampling algorithms with simulated data from the linear Gaussian model with K=20 using an IBP prior. See main text for other model parameters. Rows are datasets and columns are initial parameter settings. Error bars are averaged over five restarts of the sampler with different random seeds.
Figure S19: Trace plots of sampling algorithms with simulated data from the non-symmetric LFRM model with K=5. See main text for other model parameters. Rows are datasets and columns are initial parameter settings. Error bars are averaged over five restarts of the sampler with different random seeds.
Figure S20: Trace plots of sampling algorithms with simulated data from the non-symmetric LFRM model with K=20. See main text for other model parameters. Rows are datasets and columns are initial parameter settings. Error bars are averaged over five restarts of the sampler with different random seeds.
Figure S21: Trace plots of sampling algorithms with simulated data from the non-symmetric LFRM model with K=20 using an IBP prior. See main text for other model parameters. Rows are datasets and columns are initial parameter settings. Error bars are averaged over five restarts of the sampler with different random seeds.
Figure S22: Performance of different samplers using synthetic data from the LFRM model. The box plots represent the distribution of values from 80 random starts of each parameter setting. We show the values of the relative log density (right) and reconstruction error (left).
Figure S23: Trace plots of sampling algorithms with simulated data from the PyClone model with D=4 and K=4. See main text for other model parameters. Rows are datasets and columns are initial parameter settings. Error bars are averaged over five restarts of the sampler with different random seeds.
Figure S24: Trace plots of sampling algorithms with simulated data from the PyClone model with D=10 and K=8. See main text for other model parameters. Rows are datasets and columns are initial parameter settings. Error bars are averaged over five restarts of the sampler with different random seeds.
Figure S25: Trace plots of sampling algorithms with simulated data from the PyClone model with D=8 and K=12. See main text for other model parameters. Rows are datasets and columns are initial parameter settings. Error bars are averaged over five restarts of the sampler with different random seeds.

5.4 Supplementary tables

Metric Relative log density RMSE
Time
10.0 0.0000 0.0000
100.0 0.0000 0.0060
1000.0 0.0000 0.0393
Table S1: Comparison of the performance of PG algorithm using different number of particles. P-Values are computed using the Friedman test. Significant values at () are indicated in bold.
Relative log density RMSE
Mean difference P-Value Mean difference P-Value
Time Param 1 Param 2
10 10 100 2.7480 0.0000 -0.5587 0.0000
2 -0.2048 0.5277 0.0359 0.9929
20 0.3871 0.0245 -0.0349 0.9845
5 -0.1429 0.6517 -0.0092 1.0000
50 1.4779 0.0000 -0.2271 0.0001
100 2 -2.9527 0.0000 0.5946 0.0000
20 -2.3609 0.0000 0.5238 0.0000
5 -2.8909 0.0000 0.5495 0.0000
50 -1.2701 0.0477 0.3316 0.0001
2 20 0.5918 0.0000 -0.0707 0.7447
5 0.0619 1.0000 -0.0450 0.9981
50 1.6826 0.0000 -0.2630 0.0000
20 5 -0.5300 0.0000 0.0257 0.9640
50 1.0908 0.0000 -0.1922 0.0031
5 50 1.6208 0.0000 -0.2179 0.0000
100 10 100 0.1149 0.0477 0.0470 NS
2 -0.0446 0.3834 0.0728 NS
20 0.0268 0.9879 0.0589 NS
5 -0.0005 1.0000 0.0379 NS
50 0.0207 0.9845 0.1246 NS
100 2 -0.1595 0.0000 0.0258 NS
20 -0.0881 0.2976 0.0119 NS
5 -0.1154 0.0218 -0.0092 NS
50 -0.0942 0.3180 0.0776 NS
2 20 0.0713 0.0720 -0.0139 NS
5 0.0441 0.5526 -0.0350 NS
50 0.0653 0.0651 0.0518 NS
20 5 -0.0272 0.9485 -0.0211 NS
50 -0.0061 1.0000 0.0657 NS
5 50 0.0212 0.9393 0.0867 NS
1000 10 100 0.0185 0.5776 -0.0345 NS
2 -0.0104 1.0000 -0.0136 NS
20 0.0113 0.5776 -0.0082 NS
5 0.0154 0.9998 0.0311 NS
50 0.0055 1.0000 -0.0106 NS
100 2 -0.0289 0.6025 0.0209 NS
20 -0.0071 1.0000 0.0263 NS
5 -0.0030 0.8071 0.0656 NS
50 -0.0129 0.6517 0.0239 NS
2 20 0.0218 0.6025 0.0054 NS
5 0.0259 0.9999 0.0447 NS
50 0.0160 1.0000 0.0030 NS
20 5 0.0041 0.8071 0.0393 NS
50 -0.0058 0.6517 -0.0024 NS
5 50 -0.0099 1.0000 -0.0417 NS
Table S2: Comparison of the performance of PG algorithm using different number of particles. P-Values are computed using the Friedman test. Significant values at () are indicated in bold.
Metric Relative log density RMSE
Time
10.0 0.0000 0.0000
100.0 0.0000 0.0000
1000.0 0.0000 0.0000
Table S3: Comparison of the performance of DPF algorithm using different number of particles. P-Values are computed using the Friedman test. Significant values at () are indicated in bold.
Relative log density RMSE
Mean difference P-Value Mean difference P-Value
Time Param 1 Param 2
10 10 100 2.4108 0.0000 -0.4485 0.0000
2 0.0585 0.4064 -0.0628 0.5028
20 0.0847 1.0000 0.0521 0.9051
5 -0.0195 1.0000 0.0348 0.9290
50 1.1853 0.0000 -0.1525 0.0017
100 2 -2.3523 0.0000 0.3856 0.0000
20 -2.3261 0.0000 0.5006 0.0000
5 -2.4303 0.0000 0.4833 0.0000
50 -1.2254 0.0477 0.2960 0.0001
2 20 0.0262 0.5776 0.1149 0.0385
5 -0.0780 0.2976 0.0977 0.0477
50 1.1269 0.0000 -0.0897 0.3834
20 5 -0.1042 0.9995 -0.0173 1.0000
50 1.1007 0.0000 -0.2046 0.0000
5 50 1.2048 0.0000 -0.1873 0.0000
100 10 100 0.0153 0.7872 0.0301 0.9703
2 0.0090 0.9947 -0.1183 0.1919
20 -0.0077 1.0000 -0.0148 0.9998
5 0.0003 1.0000 -0.0283 1.0000
50 0.0028 1.0000 0.0044 1.0000
100 2 -0.0062 0.9879 -0.1484 0.0152
20 -0.0229 0.7664 -0.0449 0.8609
5 -0.0149 0.8768 -0.0584 0.9758
50 -0.0125 0.8440 -0.0256 0.9879
2 20 -0.0167 0.9929 0.1035 0.3834
5 -0.0087 0.9991 0.0900 0.1772
50 -0.0063 0.9981 0.1227 0.1379
20 5 0.0080 1.0000 -0.0135 0.9997
50 0.0104 1.0000 0.0193 0.9987
5 50 0.0024 1.0000 0.0327 1.0000
1000 10 100 0.0307 0.6025 -0.0039 1.0000
2 -0.0022 0.9995 -0.0688 0.5776
20 0.0036 1.0000 -0.0359 0.9845
5 -0.0074 0.9051 -0.0555 0.4539
50 0.0080 1.0000 -0.0355 0.6025
100 2 -0.0329 0.3180 -0.0649 0.4299
20 -0.0271 0.5776 -0.0319 0.9485
5 -0.0381 0.0588 -0.0516 0.3180
50 -0.0227 0.6757 -0.0316 0.4539
2 20 0.0058 0.9997 0.0330 0.9640
5 -0.0052 0.9906 0.0133 1.0000
50 0.0102 0.9981 0.0333 1.0000
20 5 -0.0110 0.9176 -0.0197 0.9176
50 0.0044 1.0000 0.0003 0.9703
5 50 0.0153 0.8609 0.0200 1.0000
Table S4: Comparison of the performance of DPF algorithm using different number of particles. P-Values are computed using the Friedman test. Significant values at () are indicated in bold.
Metric Relative log density RMSE
Time
10.0 0.0000 0.0000
100.0 0.0000 0.0581
1000.0 0.0000 0.0001
Table S5: Comparison of the performance of PG algorithm using different resampling thresholds. P-Values are computed using the Friedman test. Significant values at () are indicated in bold.
Relative log density RMSE
Mean difference P-Value Mean difference P-Value
Time Param 1 Param 2
10 0.0 0.25 -0.0913 0.9846 -0.0222 0.9740
0.5 -0.0146 1.0000 -0.0581 0.3502
0.75 -0.0197 1.0000 -0.0552 0.7304
1.0 0.5900 0.0000 -0.1732 0.0000
0.25 0.5 0.0767 0.9885 -0.0359 0.8245
0.75 0.0716 0.9671 -0.0330 0.9885
1.0 0.6813 0.0000 -0.1510 0.0006
0.5 0.75 -0.0051 1.0000 0.0029 0.9916
1.0 0.6047 0.0000 -0.1152 0.0467
0.75 1.0 0.6097 0.0000 -0.1180 0.0070
100 0.0 0.25 0.0173 0.7797 -0.0516 NS
0.5 0.0330 0.5100 -0.0577 NS
0.75 0.0300 0.8643 -0.0361 NS
1.0 0.1700 0.0000 -0.1385 NS
0.25 0.5 0.0158 0.9983 -0.0062 NS
0.75 0.0127 1.0000 0.0155 NS
1.0 0.1528 0.0001 -0.0869 NS
0.5 0.75 -0.0030 0.9916 0.0217 NS
1.0 0.1370 0.0008 -0.0807 NS
0.75 1.0 0.1400 0.0001 -0.1024 NS
1000 0.0 0.25 0.0507 0.0414 -0.0787 0.2196
0.5 0.0312 0.4010 -0.0618 0.7797
0.75 0.0466 0.0664 -0.0985 0.0527
1.0 0.1380 0.0000 -0.1780 0.0017
0.25 0.5 -0.0195 0.9134 0.0168 0.9390
0.75 -0.0041 1.0000 -0.0198 0.9916
1.0 0.0873 0.0020 -0.0993 0.5947
0.5 0.75 0.0154 0.9590 -0.0366 0.6505
1.0 0.1067 0.0000 -0.1161 0.1139
0.75 1.0 0.0913 0.0010 -0.0795 0.9134
Table S6: Comparison of the performance of PG algorithm using different resampling thresholds. P-Values are computed using the Friedman test. Significant values at () are indicated in bold.
Metric Relative log density RMSE
Time
10.0 0.0000 0.0000
100.0 0.0000 0.0000
1000.0 0.0000 0.0000
Table S7: Comparison of the performance of PG algorithm using different annealing powers. P-Values are computed using the Friedman test. Significant values at () are indicated in bold.
Relative log density RMSE
Mean difference P-Value Mean difference P-Value
Time Param 1 Param 2
10 0.0 1.0 -0.8443 0.0000 0.1910 0.0051
2.0 -0.9463 0.0000 0.2878 0.0000
3.0 -0.8237 0.0000 0.2481 0.0000
1.0 2.0 -0.1019 0.0703 0.0968 0.0264
3.0 0.0206 0.9565 0.0571 0.1991
2.0 3.0 0.1226 0.3172 -0.0397 0.9307
100 0.0 1.0 -0.2557 0.0000 0.1933 0.0003
2.0 -0.3468 0.0000 0.2685 0.0000
3.0 -0.3332 0.0000 0.2875 0.0000
1.0 2.0 -0.0911 0.0020 0.0752 0.5298
3.0 -0.0775 0.0617 0.0941 0.1448
2.0 3.0 0.0136 0.8319 0.0189 0.9446
1000 0.0 1.0 -0.1944 0.0000 0.1463 0.0703
2.0 -0.2415 0.0000 0.2825 0.0000
3.0 -0.2380 0.0000 0.3128 0.0000
1.0 2.0 -0.0471 0.1615 0.1362 0.0141
3.0 -0.0436 0.7219 0.1666 0.0006
2.0 3.0 0.0034 0.8555 0.0303 0.9148
Table S8: Comparison of the performance of PG algorithm using different annealing powers. P-Values are computed using the Friedman test. Significant values at () are indicated in bold.
Metric Relative log density RMSE
Time
10.0 0.0000 0.0028
100.0 0.0000 0.0000
1000.0 0.0000 0.0000
Table S9: Comparison of the performance of DPF algorithm using different annealing powers. P-Values are computed using the Friedman test. Significant values at () are indicated in bold.
Relative log density RMSE
Mean difference P-Value Mean difference P-Value
Time Param 1 Param 2
10 0.0 1.0 -0.1168 0.0227 0.0676 NS
2.0 -0.1317 0.1023 0.0702 NS
3.0 -0.1392 0.0002 0.0842 NS
1.0 2.0 -0.0149 0.9820 0.0026 NS
3.0 -0.0224 0.7219 0.0166 NS
2.0 3.0 -0.0075 0.3735 0.0140 NS
100 0.0 1.0 -0.0108 0.9996 0.0429 0.9667
2.0 -0.0034 0.9946 0.0387 0.7514
3.0 -0.0149 0.9874 0.0622 0.4032
1.0 2.0 0.0074 0.9751 -0.0043 0.9820
3.0 -0.0041 0.9982 0.0193 0.8066
2.0 3.0 -0.0114 0.8970 0.0236 0.9820
1000 0.0 1.0 -0.0076 0.9999 0.0200 0.9999
2.0 -0.0168 0.6912 -0.0073 0.8319
3.0 -0.0191 0.9446 0.0342 0.9915
1.0 2.0 -0.0093 0.7797 -0.0272 0.7514
3.0 -0.0116 0.9751 0.0142 0.9982
2.0 3.0 -0.0023 0.9820 0.0415 0.5624
Table S10: Comparison of the performance of DPF algorithm using different annealing powers. P-Values are computed using the Friedman test. Significant values at () are indicated in bold.
Metric Relative log density RMSE
Time
10.0 0.0000 0.0000
100.0 0.0000 0.0000
1000.0 0.0000 0.0000
Table S11: Comparison of the performance of PG algorithm using different test paths. P-Values are computed using the Friedman test. Significant values at (