## 1 Introduction

Protein-mimicry with synthetic polymers has been actively pursued for decades to meet material demands within the environmental, energy and life sciences. Recently, Panganiban et al. (2018) and Jiang et al. (2020) reported a promising protein-mimic polymer system on the basis of random heterogeneous polymers (RHP) comprising monomers with distinct chemical properties. Through wet lab experiments, they found long-term screening for optimizing the monomer choices and monomer ratios can lead to some RHPs with protein-like functions. The RHP systems provide an attractive platform for developing polymer-based biofunctional materials.

In Jiang et al. (2020), several datasets of four-monomer based RHPs were synthesized, which behave like natural protein-based proton channels. The four methacrylate-based monomers were selected by chemical diversity, i.e., methyl methacrylate (MMA), 2-ethylhexyl methacrylate (EHMA), oligo (ethylene glycol) methacrylate (OEGMA), and 3-sulfopropyl methacrylate potassium salt (SPMA). MMA and EHMA are the hydrophobic monomers, while OEGMA and SPMA are the hydrophilic monomers. RAFT polymerization was used to control the ratio of four monomers in an RHP dataset, in order to tailor the overall hydrophobicity. Different RHP datasets with varying monomer ratios were generated, and some of them were found capable of being inserted into lipid bilayers and promoting transmembrane proton transport. However, there is a difference from natural proteins that make up the channels. A functional natural protein has certain specific amino acid sequence(s). But the functional performance of each RHP dataset is determined as a whole, and chains in an RHP dataset, which are synthesized under a fixed monomer ratio, are largely different from each other. The sequence heterogeneity in RHP brings challenges to directly linking individual monomer sequence to the overall performance of a polymer dataset. Study of the sequence heterogeneity under various monomer composition ratios would shed light on the understanding of RHPs’ functional adaptability, enabling their uniform behavior in various environments. More importantly, it offers promising potential for further RHP design in a predictable manner.

As RHPs’ local hydrophobicity levels relate to short-range interactions, each RHP chain can be studied as a concatenation of segments of various cumulative hydrophobicity. It is believed that varied segment lengths and ending positions substantially affect RHPs’ chemical properties. Therefore, segmentation can reveal valuable latent structure information on RHP heterogeneity and functionalities when individual chain experimentation is unavailable. The differences in cumulative hydrophobicity among segments introduce the segment heterogeneity. To analyze such heterogeneity, the segments are assigned to three states: the hydrophilic state, , for the segments rich in hydrophilic monomer OEGMA and the negatively charged SPMA; the hydrophobic state, , for those comprising mostly hydrophobic monomers MMA and EHMA; the intermediate state, , for ones that cannot be simply assigned as hydrophobic or hydrophilic (amphiphilic segments). The segments have the ability to span across the bilayer region and provide the proton transport pathway. The segments prefer the water environment. The segments are believed to have short residence time in both bilayer and water regions.

Figure 1 displays the schematic illustration of RHP chains in a lipid bilayer. Note however, information on segmentation and spatial structure is not available in practice, and we only have access to the linear sequence of four monomers in each RHP dataset.

Determining how RHPs’ performance relates to segment heterogeneity and sequence heterogeneity, rather than to a specific realization of polymer sequence, provides a promising strategy for developing biofunctional materials. The two types of heterogeneity correspond to two key factors in RHP design: (1) the segmentation of local hydrophobicity that regulates short-range interactions, and (2) the choice of overall monomer composition ratio that modulates the solubility. So far, domain knowledge is extremely limited in the field. There is also a lack of statistical study that could bridge the experimental restrictions and RHPs’ predictable performance. To exploit the statistical design of RHP and to aid its synthesis, we utilized the probabilistic graphical model framework and developed a generalized hidden semi-Markov model (GHSMM-RHP). Based on this model, we predicted the segments of varying hydrophobicity within RHP chains to address the first key factor above. To address the second factor, we statistically quantified the heterogeneity of chain distributions among different RHP datasets. GHSMM-RHP helps to extract the hidden phenomena that are otherwise hard to quantify, and utilizes efficient approximate inference in place of expensive exact computation. That is, we aim to address a joint estimation-prediction problem in RHP systems under the GHSMM-RHP model.

Specifically, GHSMM-RHP adopts the residential time design, which has fewer parameters and lower time complexity than generic hidden semi-Markov model (Yu, 2010). GHSMM-RHP incorporates additional layer of nodes to allow handling segment constraints such as the monomer positions, as outlined in Section 3. Based on the model, we developed stochastic variational methods in closed-form under both frequentist and Bayesian settings. To keep the monomer dependencies within an RHP chain, we derived message-passing formulas on the graphical model of conjugate exponential families (Maathuis et al., 2018). GHSMM-RHP differs from existing efforts in its novel design that incorporates specific constraints for large RHP systems, and provides a scalable graphical model with efficient algorithms that meets the inference demand in RHP applications.

In addition, we studied the GHSMM-RHP model through a simulation analysis, and compared estimates from the frequentist and Bayesian stochastic variational methods. In particular, we implemented stochastic variational expectation-maximization (SVEM) and stochastic variational Bayesian (SVB), and empirically addressed their estimation/prediction accuracy and computational (in)stability against hyper-parameters. We found that both methods achieved low statistical error, while SVEM exhibited greater stability against random initializations. Our work makes the first attempt to bring the two stochastic variational methods, which were studied separately before, to the same nonconvex regime. It also contributes to the current literature on the computational error of stochastic variational methods.

We further evaluated the GHSMM-RHP model using real datasets. Jiang et al. (2020) synthesized seven RHP datasets with varied compositional ratios and tested their performance. Among them, RHP with a compositional ratio of 50(MMA) : 20(EHMA) : 25(OEGMA) : 5(SPMA) was the most functionally active, which we refer to as RHP50/20. We used RHP50/20 as a reference and compared other RHPs to RHP50/20 based on the GHSMM-RHP model. The RHP datasets identified by our model as having high similarity to RHP50/20 all produced high proton transport efficiency in the lab experiment, which suggests that the GHSMM-RHP model has the potential to aid in predictable RHP performance. By bridging the gap between costly laboratory assessments and less expensive computational models, we hope to provide guidance for designing improved RHP syntheses.

The rest of the paper is organized as follows. In Section 2 we review the background of methods related to the GHSMM-RHP model. In Section 3 we describe the model that incorporates the constraints of RHP systems. In Section 4 different stochastic variational methods are derived. We implemented the algorithms and compared the results in Section 5. Section 6 concludes the work and discusses avenues for further research.

## 2 Method Background Review

### 2.1 Relevant Work on the GHSMM-RHP Model

The GHSMM-RHP model is inspired by previous attempts by Jiang et al. (2020)

to use a Hidden Markov model (HMM) to analyze certain restricted RHP datasets. HMMs are widely used to describe a sequence of observable events which are generated by a sequence of internal hidden states. It has a long successful history in biological sequence analysis, such as gene prediction

(Munch and Krogh, 2006) and modeling DNA sequencing errors (Lottaz et al., 2003).We noticed two drawbacks of using HMM in RHP applications. First, the hidden state length distribution in HMM implicitly follows a geometric distribution. However, the segment length of RHP chains has a clear lower and upper cutoff, which needs to be configurable in the model to test key polymerization assumptions

(Jiang et al., 2020). As suggested by Rabiner (1989), more flexible state durations need to be incorporated to improve parameter estimation and segment prediction. To overcome this issue we adopted a semi-Markov modeling of the underlying hidden states (Yu, 2010). Such a duration model has a more scalable graphical architecture to account for segmental heterogeneity. The other downside involves the time complexity for inference algorithms. The widely used Baum-Welch algorithm (Baum et al., 1970) takes time under a classic HMM, and under residential time model (a semi-Markov model described in Yu and Kobayashi (2003)), where T is the number of iterations, N is the total number of sequences, K is the number of monomers in one RHP chain, S is the number of hidden states, and D is the span of the hidden state length distribution. The algorithms become time-consuming when training large libraries of RHP datasets. This computational issue can be alleviated by implementing stochastic variational methods, where stochastic optimizations are enabled through processing only a small portion of the entire dataset in each iteration (Robbins and Monro, 1951; Bottou, 2010).The basic idea of variational methods used in GHSMM-RHP is to turn statistical estimation into a simpler lower bound optimization problem with tractable approximation densities. It defines a family of distributions

over the latent variables (i.e., the hidden state sequences underlying the RHP chain), each considered as a candidate approximation to the true posterior or conditional probabilities.

should be large enough to reduce the approximation error, but also simple enough to allow for efficient optimization algorithms. Among the related work, Ghahramani and Beal (2001) and Winn et al. (2005) connected the junction tree algorithms on graphical models with the variational Bayesian posterior distributions. MacKay (1997) and Beal (2003) were among the first efforts to implement variational Bayesian on classic HMM models. Hoffman et al. (2013) introduced stochastic optimization into variational Bayesian, and extensions to other Bayesian time series models were reported in Johnson and Willsky (2014). Foti et al. (2014) used stochastic variational Bayesian for HMM but in a different setting: it involved a single long sequence broken into mini-batches, rather than multiple relatively short sequences.### 2.2 Relevant Work on Stochastic Variational Methods

Variational methods, such as variational expectation-maximization (VEM) and variational Bayes (VB, sometimes termed variational inference), provide alternative solutions to maximum likelihood estimation and Bayesian posterior approximation respectively (Wainwright and Jordan, 2008; Blei et al., 2017)

. Being faster than convex relaxations and Markov chain Monte Carlo as well as adaptable to different application models, variational methods have prospered in various large scale approximate inference problems such as computational biology

(Carbonetto et al., 2012)(Yogatama et al., 2014), and astronomy (Regier et al., 2015). The objective function by classic proposals (Neal and Hinton, 1998; Jordan et al., 1999) is:(1) |

Here, the likelihood function can be replaced by marginal in the Bayesian setting, where are latent variables that include parameters . One can choose over some proper variational distribution family . Based on Equation 1, some variants include those that are scaled up by stochastic approximations (Cappé and Moulines, 2009; Hoffman et al., 2013), and those that are distinct in variational objective formulations (e.g., -VB (Yang et al., 2020)) or distribution families (e.g., Auto-encoding VB (Kingma and Welling, 2013), and Boosting VB (Guo et al., 2016)).

Most theoretical justifications for variational methods, which are still in active development, focus on the statistical properties of the global optimizer (either point estimates or posterior approximations). For example, the consistency and asymptotic normality in stochastic block models (Bickel et al., 2013), mixture models (Westling and McCormick, 2019)

, or more general parametric models with mean-field assumptions

(Wang and Blei, 2019); the non-asymptotic convergence rate for latent Gaussian models (Sheth and Khardon, 2017), approximating tempered posterior (Alquier et al., 2020), point estimates using mean-field assumptions (Pati et al., 2018), or high-dimensional and nonparametric inference (Zhang et al., 2020b).Although the global optimum for variational methods can be achieved via some suitable settings (Awasthi and Risteski, 2015), the computational error involving the convergence behavior of iterative updates has been less studied. In fact, solutions to variational objective functions often suffer from multiple local optimums, making optimizations sensitive to initializations and other hyper-parameters. For example, unfavorable scenarios were constructed in the stochastic block model with futile random initializations (Mukherjee and Sarkar, 2018), and in the topic model with uninformative objective functions (Ghorbani et al., 2019). They point out cases that fail in practice due to either optimization algorithms or problem formulations.

A few studies have investigated scenarios in which the global optimum can not be attained feasibly. For example, Zhang et al. (2020a) analyzed the computational rate of convergence for mean-field variational Bayes in stochastic block models, and Balakrishnan et al. (2017) studied EM-type of iterates in broader frequentist settings. However, those results are either model-specific or conditional on proper hyper-parameters, and the additional tuning effort with pilot algorithms may not be warranted. Improvement of the computational error in variational methods with general initializations and model choice remains an open problem. Moreover, when applying variational methods in a broad range of model types, poor approximations can originate from the instability of stochastic optimization algorithms with many hyper-parameters, even if the variational family contains the true posterior (Dhaka et al., 2020).

Results on the accuracy and stability of stochastic variational methods remain inconclusive for a wide variety of models. Misperception of its properties will hinder reproducible scientific research and reliable decision making. In this paper, we make an attempt to empirically study the computational performance of stochastic variational methods under the GHSMM-RHP model, as well as to compare the performance of stochastic variation methods under Bayesian (i.e., SVB) and frequentist (i.e., SVEM) frameworks that were previously studied separately. The observations and conclusions developed in our study from a comparative perspective on SVB and SVEM not only provide insight into the choice of these two methods, but also contribute to the existing literature on the computational error of stochastic variational methods.

## 3 Generalized Hidden Semi-Markov Model for RHP

As described previously, each RHP dataset contains thousands of sequences randomly generated under a fixed monomer ratio. Note that we only have access to the one-dimensional primary structure, that is, the linear sequence of four monomers. We start by discussing a few sequence constraints, which are the same as those stated in Jiang et al. (2020). These constraints describe the underlying biology of functional RHPs: (1) We define segments to be composed of hydrophobic monomers (i.e., MMA, EHMA) and no more than one OEGMA. If an OEGMA appears in an S3 segment, it must be at least two monomers away from both ends of the segment. (2) and are separated by hydrophilic segments . (3) The negatively charged hydrophilic monomer SPMA appears only in .

The hydrophobic monomers in segments enable the insertion of the chain into lipid bilayers, and the appearance of a single OEGMA doesn’t compromise the overall activity if it is in the middle of an segment. The length of the segments depends on various factors such as chain conformation and membrane flexibility, but it should be sufficiently long to span the core of the lipid bilayer. The segments have a high preference for water (high polarity of the monomer), and appear in between and .

We adapted the right-censored hidden semi-Markov model in (Yu, 2010)

to model the RHP chain while incorporating the above constraints. Here, the hidden state duration is a random variable following a categorical distribution over a predefined range. The unobserved states correspond to three types of segments: hydrophilic state

, intermediate state , and hydrophobic state . We denote a random RHP chain as . At each position , takes values from the four monomers, i.e., MMA, OEGMA, EHMA, SPMA. are the latent variables. is the hidden segment state, is the remaining duration of the current hidden state, and is the location of an emitted OEGMA. Their distributions are described below.: The model starts with the initial state variable , taking value in the state space . follows a categorical distribution with parameters :

(2) |

: Latent variables depend on . (1) If , then the duration takes values between and , the predefined shortest and longest segment length. The OEGMA location takes values between 3 and , meaning the OEGMA shall be at least two monomers away from both ends of the segment. may also be 0, meaning there is no OEGMA in the current segment. This support is denoted as . The conditional probability follows a categorical distribution with parameters . (2) If or , then the duration takes value between and . There is no constraint on OEGMA location for and segments, so can take any value without affecting the likelihood function. We set it to 0 for convenience. This support is denoted as . The conditional probability follows a categorical distribution with parameters . and is defined similarly.

In summary, the conditional probability follows a categorical distribution with parameters

. Three sub-vectors

correspond to three supports respectively. This gives:(3) |

: For positions , latent variables evolve as a time-homogeneous Markov chain, taking values in the same joint space as the one described for . The graphical model diagram is illustrated in Figure 2. The transition distribution with parameters is defined as . When and , the transition kernel is:

As the position proceeds we note the following. (1) If , there is no transition on hidden segment state, i.e., . The remaining duration of hidden segment state will deterministically decrease by 1 until it reaches 1, i.e., , , et cetera. (2) If , the RHP will transit to a different hidden segment state with probability , followed by a new pair generated according to the categorical probability parameter . We disallow transitions between S2 and S3 due to biological constraints, i.e., . Thus, we have:

(4) | ||||

: The observed random variable is conditionally independent of all the other variables given . takes values in the space of four monomers . The emission distribution follows a categorical distribution with parameters . The four sub-vectors of are used conditional on different values of , as shown in Table 1.

Emission distribution | Support of |
---|---|

Specifically, is for categorical distribution over when . is for categorical distribution over when , because we assume the SPMA monomer appears only in segments. is for categorical distribution over if or . It is used when the current position is within an segment but no OEGMA is emitted, either because hasn’t reached the designated OEGMA location or because the current segment is made up of purely MMA and EHMA. is for categorical distribution over if . It degenerates to a deterministic value. It is used when the duration variable decreases to the designated OEGMA location , indicating that the only OEGMA should be emitted in the current segment. Thus we have:

(5) | ||||

This probabilistic model incorporates the aforementioned constraints on RHP systems, and allows for explicit modeling of the hidden states’ duration. The complete likelihood function for one RHP chain is:

(6) | ||||

In the last line of Equation 6, we plug in Equations 2 to 5 and simplify Equation 6 further to an exponential family. Here, . is the canonical parameter vector and is applied element-wise. We denote , with similarly defined. is the vector of complete-data sufficient statistics. All sequences are i.i.d. given the parameter , but for clarity we have suppressed indices for i.i.d. replicas of RHP sequences so far. Denote as observations of the -th sequence, . Denote as all RHP sequences in the index set , i.e., . and are defined similarly.

## 4 Stochastic Variational Methods for the GHSMM-RHP

Variational methods have been used for approximate inference under different scenarios when exact solutions are intractable or too expensive to achieve (Bishop, 2006; Wainwright and Jordan, 2008). To develop more efficient algorithms in modern large-scale learning problems, stochastic optimizations are introduced to bypass the need to iterate through the entire datasets in each iteration (Robbins and Monro, 1951). Combining variational methods and stochastic optimizations helps alleviate the computational complexity of estimations in the GHSMM-RHP model.

In particular, in the Bayesian setting where parameters are viewed as random variables, we coupled stochastic natural gradient descent (Amari, 1998) with variational Bayes following the work of Hoffman et al. (2013). In the frequentist setting where parameters are viewed as fixed but unknown, we coupled the stochastic Frank-Wolfe algorithm (Reddi et al., 2016) with variational expectation-maximization.

### 4.1 Stochastic Variational Bayesian

In this section, we describe the stochastic variational Bayesian (SVB) algorithms for our GHSMM-RHP model. For the complete likelihood function , we consider the conditionally conjugate exponential family (Bernardo and Smith, 2009) with priors taking the form of . Here, prior parameters are for , respectively. These add to the full model with:

(7) | |||

and refer to the same segment as described in Table 1. The complete conditional distribution of local latent variables can be written from Equation 6 as:

(8) | ||||

Here, the cumulant function . Define its conjugate function as . Note the analogue term for all N sequences is .

We then specify a family of variational distributions , aiming to approximate the true posterior distribution over all latent variables in terms of KL divergence. We make the mean-field assumption in the variational family : independence between and while structures are kept within , namely . It can be argued that the optimal variational distribution takes the same forms of exponential families as in Equations 7 and 8 (Bishop, 2006). That is, , . Here, different RHP sequences share the same local variational parameters . The sequence specific information is decoupled into sufficient statistics . Using to denote the corresponding expectations, the evidence lower bound function (ELBO) is:

(9) | ||||

where denotes inner product. The last line is based on specific forms of variational distributions. Note here the cumulant function with a probability of one, if follows Dirichlet distributions. Maximizing the ELBO function is equivalent to finding the best approximation in to the true posterior distribution. One way to do this is through stochastic natural gradient descent (Hoffman et al., 2013) while iteratively updating the local variational distribution and the global variational density . Here is the number of iterations.

E-step: solving . Convexity in yields the update .

M-step: solving . Using stochastic natural gradient descent yields the update:

(10) |

where is the randomly chosen mini-batch and is the step size at the -th iteration. To calculate , we implemented a message-passing recursion which is a special version of the junction tree algorithm (Lauritzen and Spiegelhalter, 1988). More details on update formulas can be found in Supplementary Materials section B.

### 4.2 Stochastic Variational Expectation Maximization

Under the frequentist setting, the complete likelihood in Equation 6, complete conditional distributions in Equation 8, and local variational distributions remain the same as those in SVB. Thus, stochastic variational expectation-maximization (SVEM) for the GHSMM-RHP model carries similar ELBO and parameter updates to the Bayesian approach:

(11) | ||||

E-step: solving , yielding the update .

M-step: solving . Using the stochastic Frank-Wolfe algorithm yields the update:

(12) | ||||

where is the basis vector with 1 at the j-th element and 0 otherwise, is the randomly chosen mini-batch, and is the step size at the -th iteration.

There are some notable differences with the SVB algorithm. In SVB, the cumulant function . The M-step involves evaluating with respect to the local variational distribution . In SVEM, the cumulant function for parameters that consist of sub-normalized probability vectors (e.g., the sum of elements in updated is not one). Besides the message-passing recursion used in SVB, the M-step in SVEM requires a new collecting-distributing pass to calculate the unconditional expectation regarding the complete likelihood . More details on message-passing can be found in Supplementary Materials section A. Additionally, SVEM involves constrained optimizations on probability simplexes in order to guarantee the interpretability of parameter estimates (i.e., the sum of elements in remains one). In SVB, parameter constraints are automatically guaranteed after every update.

## 5 Results

We implemented SVB and SVEM, and compared their performance using a comprehensive simulation study and a real data study from Jiang et al. (2020).

### 5.1 Simulation Study

#### 5.1.1 Setup

Data Generation This study contains 20 simulated heteropolymers (SHP) datasets labeled SHP1, …, SHP20. They were generated following Equation 6. The study combines ten sets of predetermined parameters and two different values of , the shortest lengths allowed for segments. Table 2 illustrates parameters for SHP1 and SHP2. In the simulation, each set of was combined with two values of to generate a pair of “twin” SHP datasets, e.g., SHP1 and SHP2 form a “twin” pair, SHP3 and SHP4 form a “twin” pair, et cetera. That is, each “Twin” pair of SHPs share the same initial probabilities , transition probabilities , and emission probabilities , but differ in the duration probabilities since it relies on . Thus, twin SHPs are more similar to each other than to the rest of data sets. This design ensures that the simulated SHP systems have a wide range of heterogeneity, with some being more similar and others being very different.

Simulated data | Shared parameters | Duration parameters | |
---|---|---|---|

SHP1 | 7 | ||

SHP2 | 9 |

In more detail, elements in every set of were sampled from the distribution. Recall that the supports for parameters , namely or , rely on and (hence the notation in Table 2). was set to 7 or 9, and was fixed to 25. These values are based on the current knowledge of RHP systems. For each parameter setup ( and ), we generated one SHP dataset comprised of 5000 chains, each chain at a length of 130 monomers. In each SHP dataset, 500 chains were held out and used as the test set, while the other 4500 chains were used as the training set.

Training Settings For each of the 20 simulated datasets, both SVB and SVEM algorithms were applied with various hyper-parameters. Specifically, we used 10 random initializations in each algorithm. For every initialization, we tried 15 learning rates: in SVB the step size was where , ; in SVEM the step size was where , . They become the vanilla stochastic (conditional) gradient method when the “decay” equals and the “rate” equals 1. During training we also explored varying values of batch size (in Equations 10 and 12) and shortest segment length . However, we found that neither algorithm was sensitive to these two hyper-parameters, and we do not report them here for brevity. The choice of and were fixed to 48 and 7, respectively.

In summary, we had 150 training settings. Under each of the 10 initializations, we tried 15 learning rates and reported the one that gave the highest value in objective functions ELBO.

In the following subsections, we empirically studied the two stochastic variational methods and showcased the computational instability rooted in SVB. To our best knowledge, there has been little effort to compare their performance or theoretically characterize their convergence behavior in the literature. Our comparisons between SVB and SVEM in the RHP applications may shed light on an understanding of Bayesian versus frequentist stochastic variational methods.

#### 5.1.2 Performance Evaluation

Point Estimation.
Through the GHSMM-RHP model, we obtained two point estimates for .
We implemented SVB and used the posterior mean as estimates under the Bayesian framework.
We implemented SVEM and used approximated maximum likelihood estimates under the frequentist framework.
Figure 3 shows the statistical error of the two estimates in different SHPs.
For simplicity only one out of every pair of “twin” SHPs is presented,
because we observed that
“twin” data sets, such as SHP1 and SHP2, had almost identical results. This indicates that the variational objectives of our model (i.e., Equations 9 and 11) are not sensitive to the parameter .^{1}^{1}1This relates to the formulation of objective function itself (for more discussion, see for example (Giordano et al., 2018); (Wainwright and Jordan, 2008, Chapter 7)), and is beyond the scope of current work focusing on algorithmic issues.

Figure 3 shows that SVEM consistently produces low statistical error that is comparable to the best in SVB. However, the SVB algorithm appears more sensitive to the local optima issue and may require “better” initializations (e.g., to be closer to the truth) than the SVEM, even though the global optimizer of variational objectives present arguably good statistical properties (Wang and Blei, 2019). That is, to obtain a near-optimal solution in SVB, more tuning efforts may be needed, such as trying more random initializations or using a proper pilot algorithm. Additionally, we observed that it took fewer iterations for SVB to converge in this study. Note that the running time per iteration is of the same order for both methods, because roughly equal numbers of message passes are required in the E-step.

Segment Prediction. We followed the Viterbi algorithm (Viterbi, 1967) and predicted the hidden segment states on each test set of the 20 SHPs, using parameter estimates from SVB and SVEM. Figure 4 summarizes the findings. As in Figure 3, only one out of every pair of “twin” SHPs is presented because “twin” data sets have similar results.

Panel (a) measures the overall prediction accuracy for segment states , and . Panel (b) measures the accuracy for segments alone. Compared to and , appears less frequently and is more functionally interesting because it has the potential to form a transportation tunnel. Figure 4

shows that SVB and SVEM produced comparable accuracy if trained with proper initializations. Note that SHP11 and SHP15 present lower Jaccard index for

segments (shown in panel (b)) when compared to the overall accuracy (shown in panel (a)). We observed that there are fewer hydrophobic segments in SHP11 and SHP15 than in the other SHPs. So, even if predictions are relatively poor in SHP11 and SHP15, the overall accuracy illustrated in panel (a) is not affected much due to the dominance of and in the chains.based on the 0-1 loss function. It simply calculates the proportion of monomers correctly predicted when compared to the true labels. Panel (b) presents the Jaccard index concerning solely the accuracy of hydrophobic segment

. It calculates the ratio of two numbers: the correctly predicted monomers, the union of ground truth and predicted monomers. Black cross dots marks the test accuracy under true parameters of each SHP, namely , …, .Figure 4 again shows that predictions (using the Viterbi algorithm) by SVB estimates remain less robust to random initializations, while SVEM produces consistent better predictions assessed by overall accuracy and Jaccard index. However in some datasets (e.g., SHP3), the segment prediction based on SVB estimates remains stable (shown in Figure 4) despite SVB’s lack of robustness in terms of statistical error (shown in Figure 3). The black cross dots in Figure 4 are the prediction accuracy using the true parameters for each SHP, i.e., , …, . As can be seen, the accuracy based on SVB and SVEM estimates is comparable to the accuracy based on true parameters.

#### 5.1.3 Applications

Designing bio-functional polymers relies on understanding the heterogeneity of RHP chains under various overall monomer distributions. Through GHSMM-RHP, we can statistically quantify the relationships among different RHP datasets using posterior predictive probability in the Bayesian scheme:

, which is the average likelihood of the new sequences based on distributions of learned from the observed data. Greater similarity indicates that new sequences may share more similar chemical properties with the training sequences. Several lines of empirical research have shown that the predictive inference based on variational Bayes is similar to the fully Bayes predictive inference based on Markov chain Monte Carlo (MCMC) (Blei et al., 2006; Braun and McAuliffe, 2010). Nott et al. (2012) also used variational approximations in the posterior predictive probability for a cross-validation based model selection.There are several ways to approximate the intractable posterior predictive probability. One popular method is to replace with the variational posterior and construct a Monte Carlo approximation . The time complexity is high, however, since it requires the same number of message-passings as the Monte Carlo samples. Beal (2003) proposed a simple lower bound whose computation requests only one round of message-passing as shown in Supplementary Materials section A. However, it is based on a bound of the approximation to the posterior predictive probability, which may raise concern in some situations. The third method is to concentrate the local variational distribution on its posterior mean, a point mass , such that . The computation requires only one message-passing, the same as in the second method.

Our simulations showed the above three methods gave similar numerical values to the posterior predictive probability. We chose the third method as it avoids the sample based time complexity in the first method, and serves as a relia