Introduction
The hidden Markov model (HMM) is a probabilistic formalism for reasoning about sequences of random variables, which naturally arise in temporal (e.g., in natural language) and spatial settings (e.g., in images and biological sequences). HMMs are especially valued for their efficient inference, broad applicability, and rich extensibility. In the traditional context, an HMM enables one to reason about a sequence of unobserved state variables that evolve under a Markov assumption, based only on a sequence of observations corresponding to the hidden state sequence
[Bishop2006]. Here, we extend this model to incorporate additional side information. Specifically, we assume that in addition to the sequential observations, we have access to a corresponding vector of realvalued “response variables” that somehow provide summary information about the hidden state sequence (if this vector is empty, we have a traditional HMM). We are interested in the following dual task: simultaneously learning the
hidden state sequence from the sequential observations and summary response variables (sequence annotation), as well as the mapping function that produces realvalued response variables from the hidden state sequence (sequence regression).Although applications exist in many domains, this extension to a traditional HMM was originally motivated by a large collection of biological applications. For example, promoter DNA sequences can be complemented by promoter activity to learn a model that annotates the DNA sequence, and simultaneously predicts the activity level of putative promoters. In this paper, we focus on another question: understanding proteinDNA binding preferences, which is also important in unraveling gene regulation. We are particularly interested in learning how transcription factors (TFs) exhibit sequence binding preferences (motifs), in our case modeled as position weight matrices (PWMs)
[Berg and von Hippel1987]. In situations where bound genomic DNA sequences are available, HMMs have been applied to jointly learn binding locations and motifs in an unsupervised manner [Bailey and Elkan1994, Lawrence et al.1993]. However, newer protein binding microarray (PBM) technology allows us to learn motifs from in vitro sequence data [Mukherjee et al.2004]. In a typical PBM, about 40,000 synthetic DNA sequences (probes) are placed on a microarray, and the binding affinity of a TF for each probe is reported in the form of a realvalued fluorescence intensity. Thus, motif inference on PBM data requires jointly modeling DNA sequences and realvalued measurements, which poses new challenges for model design.Many algorithms have been developed for analyzing PBM data. Approaches based on Kmers (Kgrams of DNA sequence) feed occurrences of each Kmer as features into more sophisticated models [Weirauch et al.2013, Berger et al.2006]. RankMotif++ [Chen, Hughes, and Morris2007]
maximizes the likelihood for binarized preference observations. Other approaches attempt to jointly learn a probabilistic model. For example, BEELMPBM
[Zhao and Stormo2011]estimates parameters based on a biophysical energy model. One advantage of probabilistic models over Kmer models is that binding preferences can be directly described via PWMs. DeepBind [Alipanahi et al.2015]uses a deep convolutional neural network to improve accuracy. A systematic survey of several deterministic and probabilistic models was provided by Weirauch et al.
[Weirauch et al.2013]. In a PBM scenario, the observed responses reflect the average intensity level over time and probe instances, i.e., a cluster of probes with same DNA sequence. Probabilistic frameworks such as HMMs can represent the probability distribution over all possible binding configurations, and thus may have advantages over deterministic methods.
In this paper, we develop inference methods, which we call RegHMM, that efficiently estimate both the hidden state sequence and all the various parameters of the regressioncoupled HMM model. The highlights are:

A probabilistic generative model that jointly considers sequential observations and multiple realvalued responses.

Customized link functions, summary functions, and positionwise transition probabilities providing rich and flexible functional mapping.

Fast and efficient inference with approximate expectationmaximization via Viterbi path integration.

Our methodology enables prediction at least as accurate as any other approach to date, while learning a much more interpretable model.
Related Work
Many methods have considered the task of mapping sequential observations to categorical or real responses. Discriminative HMMs use Bayesian model comparison by crossentropy for sequence classification. Previous algorithms [Collins2002, BenYishai and Burshtein2004, Sak et al.2014]
investigated discriminative training of HMM or recurrent neural networks (RNN) using maximum mutual information (MMI).
[Srivastava et al.2007]uses a profile HMM to classify biological sequences. However, many of these methods cannot straightforwardly extend to regression problems. Kernel methods
[Leslie, Eskin, and Noble2002] have also been devised to map sequential data to categorical or real responses. However, these kernel methods usually employ predefined kernels and the learned latent features encoding the original sequences are less interpretable than those coming from HMMs. Our approach, in contrast, can be applied for both classification and regression tasks and directly learn the hidden features explaining the sequences.Previous research has also examined the use of continuous side information to complement sequential observation when learning latent features. Inputoutput HMMs [Bengio and Frasconi1996] and TRBMs [Zeiler et al.2011]
consider the task of supervised learning where the observed variables influence both latent and output variables. However, these approaches assume that input observations for each time point are known. GCHMMs
[Fan et al.2015] consider an HMM where transition probabilities are associated with side information. CRBMs [Taylor and Hinton2009] incorporate side information into the factorization of weights in conditional RBMs. The RUPA method [Noto and Craven2008] learns an HMM with additional responses via augmented visit aggregation. However, point estimate approximation over expected visits ignores the true distribution of visits. As a result, convergence problems may arise, especially when the distribution of visits has multiple modes. In contrast, our method uses Viterbi path integration to directly approximate the true posterior of visits. In summary, we consider the responses as both extra information for learning a latent representation of sequential observation, and a variable to predict. Considering the dual task simultaneously enforces interpretability, while maintaining predictive power.Learning a coupled regression HMM model
Learning task
We propose the RegHMM (regressioncoupled HMM), which can be understood as a combination of HMM and regression. As shown in Figure 1, suppose we have sequence observations of length , we denote matrices: to be the sequential observations and to be the hidden states. An HMM is defined by parameters , where is the emission matrix, is the transition matrix and is the vector of initial probabilities. Further, the hidden variables, , of the HMM are employed as covariates for regression tasks based on the assumption that the realvalued responses, , are generated from summary statistics, , that encode the information of hidden state paths, where and .
As a generative process, the th response for th sequential observation, , is drawn from hidden states encoded by hidden variables , where is the total number of discrete states. This generative process of responses can be formally represented as,
where first summarize all hidden states, , via certain summary function, such as counting the visits of certain states. These hidden states summaries, , are then transformed through a link function, yielding response means . Finally,
is drawn from a Gaussian distribution with mean
. The standard error
, is shared across samples and learned from data. We let and to denote the link function and the summary function, respectively. The link function can take either linear form, or a generalized hyperbolic tangent form (tanh), , where and characterize the intercept and slope of , while and characterize its shape and center position. Further, Let denote all parameters defining the link function, i.e., for the th task.The learning aim is to jointly estimate regression parameters, , describing the link functions, as well as the HMM parameters, .
Analyzing PBM data
In this section, we consider an application of RegHMM to PBM data analysis. In particular, we desire to infer the emission matrix, representing the binding preference, from probe sequences and real responses where each sequential observation , correspond to a single response. Here and throughout the rest of this paper, we only discuss the case of a single response task (). Thus, for simplicity the subscript is omitted, further details on multiple regression tasks are provided in the supplements ^{1}^{1}1http://people.duke.edu/~yz196/pdf/AAAI_suppl.zip.
In this scenario, we have the background (nonbinding) hidden states denoted as , and the motif hidden states and indicating being bound by an assayed transcription factor, where denotes the motif length. Motifs can be represented as sense or antisense, or , respectively. Hidden variables are enforced to traverse through all subsequential motif states, once entering a motif start state or , as shown in Figure 2. Corresponding states, and , have complementary emissions, i.e., , . The transition probabilities from background states to motif start states, defined as , characterize the prior information on the frequency of binding. In a real scenario, this frequency may vary along the probe because the free end of the probe is more susceptible to being bound [Orenstein and Shamir2014]. To model this positional bias, we introduce a positiondependent transition, specifically,
where indicates the position along the sequence. Note that in the PBM setting, the elements of transition matrix other than , and are deterministic (either or ). Thus, only , and can vary along the sequences.
To align with the fact that signal intensities reflect the number of TFs bound, we explicitly choose the summary function, , to count the total occurrences of a certain set of states, , on each sequence, i.e.,
where is the indicator function and is the collection of states of interest. Three possible settings for are examined: counting the motif “entering” states, , counting the motif “exiting” states, and counting the “total bound” states, .
The link function itself, , can be either linear or hyperbolic tangent. The motivation for a nonlinear link comes from the observation that probes can “saturate” with increasing binding instances, leading to a nonlinear association between number of bound TFs and signal intensity responses [Berger and Bulyk2009]
. Empirically, the nonlinear hyperbolic tangent link outperforms a linear one. When a nonlinear transformation via tanh is used, the mapping from hidden space to responses can be understood as sigmoid belief network (SBN) with a single continuous output and constrained weights. Comparisons between different summary and link functions are provided in the supplements.
EM inference for coupled regression HMM model
Suppose we work on PBM data, where our goal is to use expectation maximization (EM) to maximize joint likelihood with respect to HMM parameters and regression parameters . A review and comparison with the standard BaumWelch algorithm, which considers maximizing can be found in the supplements.
In Estep, we estimate the conditional distribution of hidden state sequence for th sequence, then we calculate the function, i.e., the expected value of the log complete likelihood as
(1)  
(2) 
where denote the model parameters estimated by previous step, . The expectation in (1) can be factorized provided that and are conditionally independent given . In Mstep, and from (2) are then maximized.
We first examine , a key observation is that this expectation can be reformulated as the expectation with respect to or . Thereby, HMM parameters can be updated with procedures similar to BaumWelch. The only difference lies in the fact that in standard BaumWelch, can be fast and accurately computed via the forwardbackward algorithm, whereas in our case, the conditional distribution of hidden variables depends on both and , preventing us from directly applying forwardbackward filtering. As we discussed in the section above, the transition probabilities can vary along the sequence. This can be easily handled via a positionwise update of transition parameters.
Denoting to be the th sequence. For th input,
where and can be cheaply obtained. However, the conditional distributions over hidden summaries, i.e.,
, are not straightforward. A naive approach would utilize a Gaussian distribution to approximate the true posterior, where mean and variances of the Gaussian distribution can be estimated analytically using propagation tricks similar to forwardbackward calculations, as detailed in the supplements. Unfortunately, this approach only works when the real valued distribution
is unimodal, which is generally not the case.Instead, we appeal to a Viterbi path integration approximation. The approximated distribution is achieved by integrating over a truncated hidden state sequence space, denoted here as , where the number of included Viterbi paths is dynamically determined to cover 99% of the total probability mass, thus
where the are the Viterbi paths conditioned on and , and is an indicator function , provided that summaries are deterministic given a hidden state sequence . The calculation of is performed for each position and each possible state that can take. For each pair, the computational complexity is , thus a naive approach would cost . To reduce computational cost we adopted a dynamic programming approach described in Algorithm 1
, in which no additional cost is required. Briefly, this procedure first propagates through the sequence to calculate a forward and a backward tensor, while keeping track of the top paths, then it computes the conditional probabilities for each
pair altogether. In Algorithm 1, the symbol represents the Kronecker product and we use to denote the number of top paths.In the worst case, Viterbi path integration may require a large number of top Viterbi paths, which could be computationally prohibitive. However, in our experiments on PBM data, around 100 Viterbi paths usually cover 99% probability mass for most probe sequences.
So far we have considered maximizing the with Viterbi path integration. We can rewrite as
(3) 
Maximizing can be considered as a least square regression task with random variable covariates, . If the link function between and is chosen to be linear, an analytic solution based on expected summary sufficient statistics (ESSS), and can be computed. For a tanh link function, coordinate gradient descent (CGD) can be utilized to update and analytically using ESSS. The shape and position parameters, , can be updated using standard trustregion algorithm [Sorensen1982]. The expectation in (3) is with respect to , which can be calculated using same Viterbi path integration approach described before.
Full details of EM inference are provided in the supplements. MCMC approaches such as Gibbs sampling with embedded MetropolisHastings can be applied to this model. However, directly sampling hidden states may be negatively affected due to local modes, rendering mixing arbitrarily slow. Inherent forwardbackward calculation in EM may alleviate this problem. Backpropagation may not be appropriate for two reasons. First, in our model the hidden variables generate both
and , where backpropagation approaches are not straightforward. Second, we wish to estimate a distribution over hidden state sequences rather than a point estimate. Variational inference methods such as mean field or variational autoencoders [Kingma and Welling2013, Fabius and van Amersfoort2014] might be possible, but at the cost of accuracy and implementation complexity. Benefitting from caching and dynamic programing, we achieve time complexity of multiplied by the number of iterations, which is essentially the same as in BaumWelch.Experiments
Simulation studies
We conducted a simulation study in a PBM scenario, where we are interested in recovering the correct emission matrix and predicting for new observations. As a baseline, we considered a twostage approach, where is first estimated with BaumWelch algorithm, followed by a regression task to estimate . In our synthetic PBM data, 20,000 sequences (sequence length ) and corresponding signals were generated from the generative model parameterized by and . 10,000 sequence instances were used for training, while the remaining 10,000 instances were used for evaluation. A motif of length 5 with both sense and antisense representation was used. To make the task more challenging, the transition probability from background state to motif states was set to be relatively small, so that motif states were not enriched. We examined 6 different setups for combinations of summary functions (all states, entering states, exiting states) and link functions (linear, tanh). For clarity, we only show the results of one setting (all state + linear), as the other 5 settings gave similar results. Detailed results are provided in the supplements.
Table 1 shows that our method outperformed twostage baseline model in three aspects. First, the coupled model could recover parameters and close to the truth, whereas the twostage model diverged significantly. The PWM model implied by emission matrix is shown in Figure 3, RegHMM could capture the emission and transition probability well even under the case where sequential observations were less informative. Note that we were only interested in background to motif transition . Second, mean squared prediction error for heldout responses (prediction MSE) from test sequences was lower. Finally, as a generative model, the likelihood of unseens (test loglikelihood) was higher. These results indicate that our model can bridge the information between sequential observations and real responses, and leverages all the information to parse the sequences correctly. Having a good understanding about data, the algorithm can then predict reasonably well. To see how the additional real responses reinforce the learning on HMM side, intuitively, these extra responses function via both reweighting contribution of different sequences and highlighting individual positions in each sequence according to consistency between predicted responses and observed responses. Allowing transitions to vary along the sequence yield slight improvement of performances (see supplements).
Oracle  RegHMM  Twostage  

5  5.00  4.90  
4  4.10  20.61  
0  0.19  1.16  
0  
Pred. MSE  0.962  0.969  1.355 
Test LL 
We also simulated a case where the observed sequences cannot be distinguished from sequences generated by a null model, i.e., where only background state transmits to itself. Remarkably, RegHMM recovered the correct motif pattern even when HMM structure was seemingly missing, whereas twostage model failed (see supplements).
Real data analysis
We evaluated our algorithm on DREAM5, a publicly available dataset of PBM experiments for methods competition [Weirauch et al.2013]. The DREAM5 dataset consists of experimental data for 66 transcription factors under two array designs. The overall task is to use one array type ( probe sequences) for training and predict the real signal intensity responses on the other type. Apart from the prediction task, the inferred motif patterns are also of interest.
We compared RegHMM with BEEML [Zhao and Stormo2011], Seed and Wobble [Berger et al.2006], RankMOTIF++ [Chen, Hughes, and Morris2007], FeatureREDUCE [Weirauch et al.2013] and DeepBind [Alipanahi et al.2015]. As described in Weirauch et al., all sequences were tailored to 35 bases to retain only unique sequences (). Also, original signal intensities were first transformed using a logarithmic function [Weirauch et al.2013]. For RegHMM, we considered both linear and tanh link functions and two summary functions: counting all motif states and count motif entering states. We also compared between models with and without positionwise transitions. The motif length for all scenarios was set to 6. The initial value for the emission matrix was established from the most frequent 6mer.
Following Weirauch et al., the comparison was conducted based on two metrics. The Pearson correlation coefficient between predicted responses and observed responses, to assess the predictive ability of each model for unseen queries. Apart from this, we also compared the area under the receiver operating characteristic (AUROC), as the experimental data can be subject to noise and outliers. To align with the comparison done by Weirauch et al., the positives were defined as the probes that are 4 standard deviations above the mean, with an enforced minimum of 50.
As shown in Figure 4, RegHMM achieved comparable average Pearson correlation and AUROC compared with other methods over 66 experiments. We emphasize that our model can directly infer a motif pattern, whereas Kmer based models parse sequences into features that do not explicitly explain the data. Thus, our model is predicting “reasonably” by attempting to simultaneously interpret the data. Interestingly, for several transcription factors (such as TF60, TF66) reported to be hard to predict by other methods, our model achieved good prediction (figure 5). Unexpectedly, the link functions learned from some transcription factors are monotonically decreasing, indicating the number of motif instances is negatively correlated with signal intensity responses, which may explain why other models assuming only positive associations fail. Possible explanations would be that the motif found captures sequence bias of PBM experiments to some extent, or the assayed transcription factor inclined to avoid binding to certain motif pattern.
Another observation was that if we train two models separately for each one of the two distinct array designs, even though the inferred regression parameters showed slightly discrepancy, the learned HMM parameters corresponded reasonably well, suggesting a good consistency between arrays. Moreover, we found the transition probability from background to motif states (which we defined as positional bias in previous sections) decreased with positional index (see supplements), which is consistent with our knowledge that free ends of the probe favor binding.
We compared different link functions and summary functions. As shown in the supplements, using a tanh link function marginally outperformed a linear one. The choice of summary function was more TFspecific. Nevertheless, for most TFs, a summary function considering all motif states led to better performance.
Discussion
Motivated by a TF motif discovery challenge, we examined the problem of learning a regressioncoupled HMM. We proposed RegHMM, a probabilistic generative model learning hidden labeling of observed sequences by leveraging realvalued responses via a flexible set of link and summary functions. An approximate EM algorithm using Viterbi path integration is employed for fast and accurate inference. The experiments on simulated data showed that, by incorporating additional responses, RegHMM can accurately label the sequences and predict on future observations in situations where inference from sequential observation alone fails. The application to a realworld dataset showed that RegHMM can explicitly characterize proteinbinding preferences, while maintaining predictive power. The results suggested potential biological insights for further validation.
Our algorithm was designed for complementing sequential data with realvalued summaries. However, using a softmax or Poisson link, our model could be naturally extended to handle categorical or count responses. Applications may include motif finding from in vivo data where multiple tracks of count responses are available. In addition, taskspecific summary functions can be engineered to reflect domain knowledge. For example, indicator functions of motif orderings may associate with certain signal responses.
Acknowledgments
We thank Ning Shen from Duke university for helping process the data. This research was supported in part by NSF, ARO, DARPA, DOE, NGA and ONR.
References

[Alipanahi et al.2015]
Alipanahi, B.; Delong, A.; Weirauch, M. T.; and Frey, B. J.
2015.
Predicting the sequence specificities of DNA and RNAbinding proteins by deep learning.
Nature Biotechnology 33(8):831–838.  [Bailey and Elkan1994] Bailey, T. L., and Elkan, C. 1994. Fitting a mixture model by expectation maximization to discover motifs in bipolymers. Technical report, University of California at San Diego.
 [BenYishai and Burshtein2004] BenYishai, A., and Burshtein, D. 2004. A Discriminative Training Algorithm for Hidden Markov Models. IEEE Transactions on Speech and Audio Processing 12(3):204–217.
 [Bengio and Frasconi1996] Bengio, Y., and Frasconi, P. 1996. Inputoutput HMMs for sequence processing. Neural Networks.
 [Berg and von Hippel1987] Berg, O. G., and von Hippel, P. H. 1987. Selection of DNA binding sites by regulatory proteins. Statisticalmechanical theory and application to operators and promoters. Journal of Molecular Biology 193(4):723–750.
 [Berger and Bulyk2009] Berger, M. F., and Bulyk, M. L. 2009. Universal proteinbinding microarrays for the comprehensive characterization of the DNAbinding specificities of transcription factors. Nature Protocols 4(3):393–411.
 [Berger et al.2006] Berger, M. F.; Philippakis, A. A.; Qureshi, A. M.; He, F. S.; Estep, P. W.; and Bulyk, M. L. 2006. Compact, universal DNA microarrays to comprehensively determine transcriptionfactor binding site specificities. Nature Biotechnology 24(11):1429–1435.
 [Bishop2006] Bishop, C. M. 2006. Pattern Recognition and Machine Learning. Springer Verlag.
 [Chen, Hughes, and Morris2007] Chen, X.; Hughes, T. R.; and Morris, Q. 2007. RankMotif++: a motifsearch algorithm that accounts for relative ranks of Kmers in binding transcription factors. Bioinformatics 23(13):i72–9.
 [Collins2002] Collins, M. 2002. Discriminative training methods for hidden Markov models. In the ACL02 conference, 1–8. Morristown, NJ, USA: Association for Computational Linguistics.
 [Fabius and van Amersfoort2014] Fabius, O., and van Amersfoort, J. R. 2014. Variational Recurrent AutoEncoders. arXiv.org.
 [Fan et al.2015] Fan, K.; Eisenberg, M.; Walsh, A.; Aiello, A.; and Heller, K. A. 2015. Hierarchical GraphCoupled HMMs for Heterogeneous Personalized Health Data. KDD 239–248.
 [Kingma and Welling2013] Kingma, D. P., and Welling, M. 2013. AutoEncoding Variational Bayes. arXiv.org.
 [Lawrence et al.1993] Lawrence, C.; Altschul, S.; Boguski, M.; Liu, J.; Neuwald, A.; and Wootton, J. 1993. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262(5131):208–214.
 [Leslie, Eskin, and Noble2002] Leslie, C. S.; Eskin, E.; and Noble, W. S. 2002. The Spectrum Kernel: A String Kernel for SVM Protein Classification. Pacific Symposium on Biocomputing 566–575.
 [Mukherjee et al.2004] Mukherjee, S.; Berger, M. F.; Jona, G.; Wang, X. S.; Muzzey, D.; Snyder, M.; Young, R. A.; and Bulyk, M. L. 2004. Rapid analysis of the DNAbinding specificities of transcription factors with DNA microarrays. Nature genetics 36(12):1331–1339.
 [Noto and Craven2008] Noto, K., and Craven, M. 2008. Learning Hidden Markov Models for Regression using Path Aggregation. UAI 2008:444–451.
 [Orenstein and Shamir2014] Orenstein, Y., and Shamir, R. 2014. A comparative analysis of transcription factor binding models learned from PBM, HTSELEX and ChIP data. Nucleic Acids Research 42(8):e63–e63.

[Sak et al.2014]
Sak, H.; Vinyals, O.; Heigold, G.; Senior, A. W.; McDermott, E.; Monga, R.; and
Mao, M. Z.
2014.
Sequence discriminative distributed training of long shortterm memory recurrent neural networks.
INTERSPEECH 1209–1213.  [Sorensen1982] Sorensen, D. C. 1982. Newton’s method with a model trust region modification. SIAM Journal on Numerical Analysis 19(2):409–426.
 [Srivastava et al.2007] Srivastava, P. K.; Desai, D. K.; Nandi, S.; and Lynn, A. M. 2007. HMMModE – Improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences. BMC Bioinformatics 8(1):104.

[Taylor and Hinton2009]
Taylor, G. W., and Hinton, G. E.
2009.
Factored conditional restricted Boltzmann machines for modeling motion style.
In ICML.  [Weirauch et al.2013] Weirauch, M. T.; Cote, A.; Norel, R.; Annala, M.; Zhao, Y.; Riley, T. R.; SaezRodriguez, J.; Cokelaer, T.; Vedenko, A.; Talukder, S.; DREAM5 Consortium; Bussemaker, H. J.; Morris, Q. D.; Bulyk, M. L.; Stolovitzky, G.; and Hughes, T. R. 2013. Evaluation of methods for modeling transcription factor sequence specificity. Nature Biotechnology 31(2):126–134.
 [Zeiler et al.2011] Zeiler, M. D.; Taylor, G. W.; Sigal, L.; Matthews, I.; and Fergus, R. 2011. Facial Expression Transfer with InputOutput Temporal Restricted Boltzmann Machines. In NIPS, 1629–1637.
 [Zhao and Stormo2011] Zhao, Y., and Stormo, G. D. 2011. Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nature Biotechnology 29(6):480–483.
Comments
There are no comments yet.