A Bayesian Hidden Semi-Markov Model with Covariate-Dependent State Duration Parameters for High-Frequency Environmental Data

09/21/2021 ∙ by Shirley Rojas-Salazar, et al. ∙ 0

Environmental time series data observed at high frequencies can be studied with approaches such as hidden Markov and semi-Markov models (HMM and HSMM). HSMMs extend the HMM by explicitly modeling the time spent in each state. In a discrete-time HSMM, the duration in each state can be modeled with a zero-truncated Poisson distribution, where the duration parameter may be state-specific but constant in time. We extend the HSMM by allowing the state-specific duration parameters to vary in time and model them as a function of known covariates observed over a period of time leading up to a state transition. In addition, we propose a data subsampling approach given that high-frequency data can violate the conditional independence assumption of the HSMM. We apply the model to high-frequency data collected by an instrumented buoy in Lake Mendota. We model the phycocyanin concentration, which is used in aquatic systems to estimate the relative abundance of blue-green algae, and identify important time-varying effects associated with the duration in each state.



There are no comments yet.


page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Environmental time series data are measured at different frequencies but have been increasingly obtained at high temporal resolutions. These high-frequency data can be studied using a wide range of analyses. For example, li_multi-site_2021 presented a stochastic precipitation generator and applied it to high-frequency (30-second) rainfall data. lin_wind_2020

used high-frequency (1-second) supervisory control and data acquisition (SCADA) wind power data to predict power. They utilized a deep neural network to predict the wind power and incorporated the isolation forest method to identify anomalies in the data points.

High-frequency time series data in lakes provide another interesting example, and have been analyzed with different models and statistical approaches. carpenter_stochastic_2020 studied the dynamics of cyanobacteria in Lake Mendota using a drift-diffusion-jump model. The model was applied to phycocyanin concentrations measured every minute for the years 2008 through 2018. They found that for each of the years studied, the concentration of phycocyanin can be summarized with two stable states. Another example is the study by coloso_difficulty_2011

, who looked at the drivers of lake ecosystem metabolism. They fitted a multiple linear regression to high-frequency data from two temperate lakes. For each of the dependent variables, gross primary production, respiration, and net ecosystem production, different important drivers were identified, including temperature, wind speed, photosynthetically active radiation, among others.

Hidden Markov and hidden semi-Markov models provide an alternative approach for analyzing high-frequency environmental data. A hidden Markov model (HMM) consists of a sequence of unobserved discrete states and another set of observable random variables that are assumed conditionally independent given the state at each observed time point


. The transition from one state to another depends on a transition probability, which is defined conditionally on the current state, and where the probability of self-transitioning (i.e., remaining in the same state) is non-zero. This non-zero probability implies that the time spent in each state follows a geometric distribution

(yu_hidden_2010). However, this distributional assumption may not be realistic for some processes, making it necessary to additionally model the state duration. This model extension defines the hidden semi-Markov model (HSMM) (yu_hidden_2016).

Environmental data have been modeled with HMMs and HSMMs. For example, rousseeuw_hybrid_2015

applied a hybrid HMM to model phytoplankton dynamics using data measured every 20 minutes in a marine station. They incorporated a spectral clustering method into their HMM modeling in order to build a fully unsupervised HMM.

stoner_advanced_2020 developed an HMM to analyze sub-daily rainfall data, and, through the use of simulations, were able to show that their model can capture characteristics such as long dry periods or seasonal variation. They also applied the model to a real dataset of hourly time series of rainfall in Exeter, UK. Similar types of data have been analyzed with semi-Markov and HSMMs. For example, king_semi-markov_2016 present an extension of the Arnason-Schwarz model where they define a semi-Markov model for the state process, and apply the model to capture-recapture data of house finches. sansom_spatial_2008 studied the spatial and temporal variation of rainfall with an HSMM using a high temporal resolution rainfall dataset collected in New Zealand.

When the duration in an HSMM is modeled with a Poisson distribution, the duration parameter, which can be different for each hidden state, is assumed to be constant in time. This assumption, however, might not be reasonable in all cases. If we consider, for example, hourly rainfall data observed over the course of a year, and we model it with two different states representing wet and dry episodes, we would expect the length of time of these episodes to be different depending on the time of year due to seasonal rainfall patterns (e.g., monsoon season).

To capture this temporal variation in the duration in each state, we extend the HSMM by modeling the duration parameters as a function of time varying covariates. This enables the identification of factors associated with the time spent in the different states. For example, when there is a state transition, the duration parameter for the new state could be modeled as a function of covariates observed in the period leading up to the transition, or the value of the covariate at the moment right before the switch. The functional relationship between covariates and the parameters of the duration distribution could be state-specific, and modeling these relationships can provide important inference with regard to their extent and direction. Importantly, inference is not obtained at the high-frequency level at which the data are collected, rather it is obtained in terms of the duration intervals.

In both HMMs and HSMMs, observations are described with an emission distribution and are assumed to be conditionally independent given the state, meaning they are independent of previous states and observations pohle_selecting_2017; yu_hidden_2016. However, high-frequency data are more likely to be correlated. The violation of the conditional independence assumption can have dramatic impacts on statistical inference. pohle_selecting_2017 presented a simulation study to determine the effects of assumption violations in the selection of the number of states in HMMs, and determined that when the conditional independence assumption is violated, and the Akaike and Bayesian information criteria are used to select between models, the number of states will be overestimated.

Markov-switching regression models (MSR) and neural networks (NN) are two approaches that have been used when the conditional independence assumption is not met. In MSR, the observations are modeled as a function of covariates or as an autoregressive model

(langrock_markov-switching_2017). In the context of NN, ravuri_how_2016 apply a deep neural network HMM (DNN-HMM) and show that deeper NNs compensate for the conditional independence assumption violation more than shallow NNs. In addition, dai_recurrent_2017

consider a recurrent hidden semi-Markov model (R-HSMM) that incorporates a recurrent neural network (RNN) in the observation model of an HSMM to accommodate more complex dependencies in the observation sequence.

The disadvantage of approaches such as NN is that they are computationally expensive to implement. Thus, it is useful to consider a more computationally tractable approach for mitigating the conditional dependence. One approach that has not been considered in this context is data subsampling. Data subsampling is used for reducing computational cost or in determining the sampling distribution of a statistic. For instance, it has been used as an alternative to deal with large datasets to increase efficiency in Firefly Monte Carlo (FlyMC) (maclaurin_firefly_2015)

and in subsampling Markov chain Monte Carlo (MCMC)

(quiroz_speeding_2019). Both of these approaches present an MCMC sampling algorithm that considers only a subset of the data at each iteration. Experiments conducted to evaluate the performance of FlyMC found it to be more efficient than standard MCMC sampling in many instances. Similarly, quiroz_speeding_2019 showed that subsampling MCMC is often more efficient than standard MCMC and other competing subsampling algorithms. These studies suggest that, for an HSMM, random data subsampling can be introduced as part of the MCMC algorithm as an attempt to reduce or eliminate the conditional dependence in the data.

The goal of this paper is to develop an HSMM with time-varying duration parameters that are dependent on covariates for studying high-frequency environmental data. Specifically, we model high-frequency concentrations of phycocyanin, an estimate of presence or relative abundance of cyanobacteria (blue-green algae), in Lake Mendota, Wisconsin. Understanding the temporal variation in cyanobacteria concentration in an urban lake is important to public health, since concentrations of blue-green algae can produce toxins that are linked to illness in humans and animals. We use covariates to explain the variation in the duration in each state and obtain inference on important characteristics. Previous approaches using HMMs or HSMMs have included covariates in the observation model or in the specification of transition probabilities (e.g. koki_forecasting_2020; economou_mcmc_2014; titman_semi-markov_2010), but the inclusion of covariates in the model for state durations has not been considered. Additionally, we propose a data subsampling approach to mitigate the violation of the conditional independence assumption that is common in high frequency data.

In Section 2 we present the phycocyanin data used in this analysis. Section 3 provides a description of the HSMM, which defines the duration distribution parameter as a function of covariates, as well as the details of a simulation study to investigate the effects of dependence in the observation sequence. The results of the model applied to the phycocyanin dataset are presented in Section 4. Lastly, Section 5 provides a discussion and conclusion, as well as future extensions.

2 High-frequency Lake Mendota data

The high-frequency environmental data in this application correspond to measurements from Lake Mendota, Wisconsin, in 2018. The data can be found in the North Temperate Lakes Long-Term Ecological Research program database (lead_pi_north_2020). Sensors in an instrumented buoy located in the lake recorded measurements every minute. In 2018, the buoy recorded observations from April 11 to November 15. The dataset consists of several other variables including weather conditions (air temperature, relative humidity, wind speed, and wind direction), and lake characteristics (such as chlorophyll, photosyntetically-active radiation, dissolved oxygen, etc).

Phycocyanin is a pigment of cyanobacteria, and provides an estimate or diagnostic of its presence and concentration (carpenter_stochastic_2020). Given that high concentrations of cyanobacteria are a major public health concern, particularly in urban lakes such as Lake Mendota, it is essential to understand the temporal variation in concentration levels as well as possible environmental drivers of this variation. For each year of their study, carpenter_stochastic_2020 modeled the phycocyanin concentrations and identified two regimes of low and high phycocyanin, as well as abrupt transitions between the states. The objective of our analysis is to extend on their work by describing the latent states of cyanobacteria as captured by phycocyanin concentration, the duration in each of those states, and the covariates associated with that duration.

Following the methods in carpenter_stochastic_2020, we consider the standardized levels of phycocyanin as our observation sequence. We first compute the maximum hourly measurement, which results in a total of 5232 observations (Figure 1). These maximum values of phycocyanin are measured in relative fluorescent units (RFU). They are standardized by being log transformed (), centered and scaled, and fitted using a dynamic linear model. See the supplementary information in carpenter_stochastic_2020 for more details.

In a study done in Lake Mendota, soranno_factors_1997 found that weather variables can impact the dynamics of algae at finer time scales. Considering this, the variables we examine as possible covariates to capture the time variation in the state durations are hourly average air temperature, wind speed, relative humidity and photosynthetically-active radiation (PAR). During the time period April 11 to November 15, 2018, the air temperature, measured in C, ranged from -7 to 33, with a mean of 17. The average and maximum wind speed were 3.8 and 15 m/s respectively. Relative humidity ranged from 21% to 100% with an average of 74%. Radiation was measured with a surface sensor in µ and it ranged from 0 to nearly 2000.

3 Hidden semi-Markov model with covariate-dependent duration parameters

We begin by specifying the HMM and important notation. Then, we extend the HMM to the HSMM, develop the state-specific duration model as a function of covariates, and describe the methods for Bayesian inference. Lastly, we propose a simulation to investigate the implications of the violation of the conditional independence assumption.

3.1 Hidden Markov model and notation

An HMM includes two stochastic processes, one represents a Markov chain of states that are hidden, and the other generates a sequence of observations that are influenced by the unobservable states (rabiner_tutorial_1989; yu_hidden_2010). In a discrete-time HMM, the sequence of observations from time to can be denoted as . The corresponding sequence of unobserved states is denoted as , where , , and is the total number of unique states. The state at time has a distribution defined by , . The transition to the next state, , is conditional on state according to the Markov property. In general, the transition probability matrix provides the probabilities of transitioning from one state to another when the state space is discrete and constant in time. The matrix , has entries , with , where , and .

Figure 1: Phycocyanin standardized levels in Lake Mendota, 2018. Panel A. Full period (mid April to mid November). Panels B and C show 5-day periods in June and November, respectively.

Observations are emitted by each of the states in the hidden sequence (Figure 2

A) following a state-dependent probability distribution

. Assuming the observation distribution is Gaussian, the parameters

correspond to the mean and variance for each state:

and . The joint likelihood of the observations can be written as:

and the likelihood of the Markov chain is:

The complete likelihood of the Markov model is the joint likelihood of observations and states: . In summary, an HMM with states and observations has a set of model parameters that includes the emission distribution parameters and , the initial distribution probabilities , and the transition probability matrix .








Figure 2: State and observation sequences. Panel A. HMM: One observation is emitted by each state in the sequence. Panel B. HSMM: Several observations are emitted by each state, the number is determined by the duration in the state.

3.2 Hidden semi-Markov model

Figure 2B illustrates the HSMM where instead of assuming there is only one observation per state, a sequence of observations are emitted. The number of observations depends on the amount of time spent in the state. Following the notation in economou_mcmc_2014, let represent the length of time that the sequence remains in a state before transitioning. These durations are labeled in Figure 2B as , where Q is the number of intervals or segments. For we define to be the cumulative duration in segments through . Lastly, we define as the duration distribution for each state , , with parameter .

Similar to the Markov model, the likelihood of the semi-Markov model has two main components consisting of the likelihood of the observations conditional on the states and the likelihood of the semi-Markov chain of states. The joint likelihood of the observations can be specified analogous to the HMM case, but is written incorporating the segment-specific notation:



corresponds to the vector of all the observations in time interval

. The likelihood of the state sequence includes the distribution of the first state, the transition probabilities for the state switches, as well as the information from the duration times:


Thus, the joint distribution of data, states and durations of the hidden semi-Markov model can be written as:


Note we have added the duration distribution parameters of each state to the list of parameters of the HMM. Specifically, the set of model parameters of the HSMM presented includes , , , and .

3.3 Use of covariates to model duration

Previous approaches have specified non-homogeneous HMM and HSMMs by modeling the parameters of the emission distribution or the probabilities of transition using covariates. We propose introducing non-homogeneity in the HSMM duration by letting the parameters of the state duration distribution vary in time as a function of covariates. If we let the duration distribution be a zero-truncated Poisson, we can define the duration parameter of the interval as a function of the covariate measurements observed prior to the transition at . Notice that this specification enables the duration parameter to be both state-specific and vary in time.

Let be an covariate matrix with rows corresponding to times to , where is the number of covariates. Let be an -dimensional coefficient vector for state (accounting for an intercept in the model). Then the duration parameter for interval , which we denote as , is a function of the covariate values observed up to time point (the first rows of X), and state specific coefficients . Here, can take any functional form of the covariates as long as . For example, we can write the function as:


where is a specified function that ensures , and can be any function of the covariates observed from time 1 to the time previous to the transition, . The joint distribution, which now includes the state-specific duration parameter function, can be written as:


where are the initial values for the covariates, and is the matrix of -coefficients with M rows and the number of columns is the number of covariates, , plus an intercept:

That is, is the row of that corresponds to state . For example, when the state in interval is 1, then .

3.4 Estimation of model parameters

Model inference can be obtained in a Bayesian framework using Markov chain Monte Carlo (MCMC) and a Metropolis-within-Gibbs sampling algorithm (see Appendix A for the detailed sampling algorithm). To complete the model specification, we assign diffuse priors to the model parameters. The means of the emission distribution are assigned independent Normal priors, and the variances are assigned inverse-Gamma priors, . The initial probabilities, as well as each of the rows of the transition matrix, have Dirichlet priors, and , respectively. The coefficient parameters in the model for the state-specific durations are assumed to be independent and distributed as .

The posterior distribution of the states, durations, and rest of the parameters of the HSMM can be summarized as:


The state means and variances, initial probabilities, and transition probabilities can be sampled from their full conditionals using a Gibbs update, whereas a Metropolis algorithm is needed for the duration distribution coefficients.

economou_mcmc_2014 provide an MCMC implementation of the HSMM using a forward algorithm to estimate the parameters, which alleviates the need to sample the state sequence in the process. However, our model requires sampling the states in order to obtain inference on the parameters of the duration distributions. The state sequence in an HSMM can be sampled with the Gibbs sampler presented in johnson_bayesian_2012, and we use it to sample the states in each iteration of our MCMC. Additionally, the MCMC algorithm has to be adjusted to account for the conditional dependency in the observations. This modification is the inclusion of a subsampling approach that is discussed in the next section.

3.5 Violation of the conditional independence assumption

We consider a simulation study to investigate the effect of the conditional independence assumption violation in the emission distribution parameter estimation. We simulate dependent data using an AR(1) model under several scenarios. Then, we fit an HSMM model to each generated dataset using a data subsampling approach, where at each iteration of the MCMC algorithm, an independent and random subset of the data is selected according to a specified sampling rate.

The different scenarios included in the simulation study are defined according to sample size, AR(1) autocorrelation parameter, and number of states. Three different number of states are considered ( 2, 3 and 4), while the possible values for the autocorrelation parameters are 0.25, 0.50, and 0.75, as well as a case with no autocorrelation. Although the number of simulated observations varies across all realizations, we consider two cases consisting of approximately 2500 and 5000 observations. Overall, 24 scenarios were considered, each with 100 realizations.

The procedure for simulating the data is as follows. First, the initial state, , is sampled from its distribution. Second, the first duration, , is generated from the corresponding zero-truncated Poisson distribution. The observations,

, are generated from a Normal distribution with a specified autocorrelation. Then, the second state,

, is sampled conditional on according to the transition probability matrix P. After that, the duration is sampled as well as the observations emitted by that second state. The process continues until the specified sample size is reached.

Each of the simulated realizations is modeled with an HSMM assuming independence in the observations. For ease of computation, the states and other parameters are fixed and only the emission distribution parameters are estimated.

We investigate the benefit of the subsampling approach under various sample sizes, autocorrelation parameters, and number of states. In this approach, each iteration of the MCMC algorithm uses a different random subset of data according to a pre-specified percentage. The sampling rates to be considered are 100, 90, 80,

, 10%. The 90% credible intervals (CI) for the mean and variance parameters for all 100 datasets across all scenarios and sampling rates are then calculated. We assess the subsampling approach by comparing the empirical coverage, and determine the preferred data sampling rate as the one that results in the nominal coverage.

The results of the simulation study are provided in Appendix B. Overall, the correlation in the data affects the estimation of the emission distribution parameters, but the effect can be reduced by subsampling during the model fitting procedure. We use this result for the phycocyanin example in Section 4.

4 Application to Lake Mendota phycocyanin data

The HSMM specified in Equation 5 was used to model the hourly maximum standardized levels of phycocyanin in Lake Mendota, for the period April 11 to November 15, 2018. The duration (hours) in each state is modeled using a zero-truncated Poisson distribution, where the duration parameter is defined as a function of four covariates. These include temperature, wind speed, relative humidity, and PAR. Including the intercept term, this results in five coefficient parameters for each state. Following the notation in Equation 4, the duration parameter function in this application is defined as:


To determine the optimal subsampling rate for our analysis, we first ran the algorithm without subsampling and determined an approximate value for the size of the segments. The durations varied by state and through time, but the average duration ranged from 21 to 55. We divided the data into segments to resemble the groups of emitted data by a state and calculated the mean autocorrelation across all groups. Considering group sizes of 21 to 55 resulted in autocorrelation values that ranged from 0.653 to 0.797, which we then compared to values in Table B.2. Using suggests a subsampling rate of approximately 30% for estimating the emission distribution parameters.

The number of states in HMMs and HSMMs has to be chosen a priori and is usually determined through information criteria or expert knowledge (liu_bayesian_2020). However, the information criteria can lead to selecting a higher number of states given that they under-penalize model complexity, as mentioned in celeux_selecting_2008 and in the discussion of spiegelhalter_bayesian_2002. In our case, the deviance information criterion (DIC) always selected the model with the largest number of states, motivating the need to consider other measures for model selection. Specifically, we considered a convergence diagnostic as an alternative for choosing the appropriate number of states. We obtained multiple independent parameter chains for cases with 2, 3, 4, and 5 states and assessed convergence with the potential scale reduction factor (PSRF) and its multivariate equivalent (MPSRF) discussed in brooks_general_1998. The MPSRF for the 2, 3, 4, and 5 state models were 1.83, 1.08, 1.19, and 6.84, respectively. In addition, in the 3 state case the upper bound of the PSRF value for each model parameter was less than 1.2. Thus, we selected the 3 state model as our model for further analysis.

The MCMC algorithm was run for 24000 iterations. The first 4000 iterations were obtained using an adaptive random walk Metropolis algorithm for the duration distribution coefficients. These iterations were used to select the proposal variances for the random walk and then discarded. The remaining 20000 iterations were obtained based on these fixed proposal variances and these samples were used for parameter inference.

The posterior mean and 95% credible intervals for the emission distribution parameters in each of the states is presented in Table 1. There is a clear distinction between the mean phycocyanin in each state since none of the credible intervals overlap. The three states represent low, medium, and high cyanobacteria states. The variability in the middle state is notably higher, and the wide range of this state can be seen in Figure 3. These low, medium and high states of cyanobacteria can be associated with the regimes found in carpenter_stochastic_2020. Recall that they identified two stable states, which are comparable to our low and high states (S1 and S3), while the shifts in between these two regimes correspond to our middle state (S2).

State Mean Variance
S1 -2.18 (-2.30, -2.02) 0.76 (0.64, 0.95)
S2 0.09 (-0.06, 0.31) 0.45 (0.35, 0.56)
S3 1.75 (1.65, 1.86) 0.45 (0.38, 0.54)
Table 1: Posterior mean and 95% CI of the emission distribution parameters
Figure 3:

Phycocyanin standardized levels classified by latent states of cyanobacteria. Panel

A. Full state sequence. For each time point, the state is the mode obtained among all iterations. Panel B zooms in the period enclosed in the rectangle above. A percent stacked bar for each time point shows the relative distribution of the states sampled in the iterations.

The posterior probabilities of transitioning are shown in Table

2. Overall, it is more likely there is a transition between adjacent states (e.g., 1 to 2) than a jump from 1 to 3. This is also true for transitions from higher to lower cyanobacteria states (e.g., 3 to 2). The transition between adjacent states is expected given that the middle state represents a passing state from regimes of cyanobacteria concentration.

Transition State 1 State 2 State 3
State 1 0.96 (0.86, 1) 0.04 (0, 0.14)
State 2 0.59 (0.43, 0.74) 0.41 (0.26, 0.57)
State 3 0.09 (0, 0.28) 0.91 (0.72, 1)
Table 2: Posterior mean and 95% CI for the state transition probabilities.

Table 3 provides the number of segments in each state as well as summaries of the duration parameters and durations for each state. The second state has more segments, yet the average duration parameter for this state is much smaller than the other two states. The first and third state have fewer segments, but both the mean and variation in the duration parameter is greater than for the second state. The variation in the duration parameters both within and between states signifies the importance of modeling the durations using covariates and state-specific parameters.

State # segments Duration parameter Duration (hours)
Mean Minimum Maximum Mean Minimum Maximum
S1 26 84.2 33.7 198.2 84 17 234
S2 42 33.2 14.7 65.9 33 1 93
S3 17 102.4 69.0 139.3 102.4 42 177
Table 3: Segments and duration parameter statistics

The posterior means and credible intervals for the coefficients of the state-specific duration distribution parameters are given in Table 4, with the significant coefficients presented in bold font. The coefficients of the duration distribution provide information about the average hourly duration in the states.

Air temperature is significant in capturing the variation in the duration in the low cyanobacteria state and is inversely related to duration. That is, when the temperature before a transition to the low state is warm, the duration in that state is shorter. Wind speed is also significant in the low cyanobacteria state. When wind speeds prior to a transition to the low cyanobacteria state are high, we anticipate a shorter duration in that state. Relative humidity is a significant predictor of duration in the second state. Lastly, the photosyntetically-active radiation covariate is related to the first and second state. The positive coefficient for the first state indicates that when PAR is at higher levels before there is a transition to the lower cyanobacteria state, we expect an increase in the duration. The relation is inverse to the duration in the intermediate state of cyanobacteria concentration.

Variable State 1 State 2 State 3
Intercept 4.37 (4.22, 4.57) 3.43 (3.22, 3.64) 4.57 (4.26, 4.81)
Temperature -0.16 (-0.28, -0.02) 0.03 (-0.12, 0.20) 0.08 (-0.18, 0.30)
Wind speed -0.19 (-0.34, -0.03) 0.07 (-0.08, 0.21) 0.04 (-0.15, 0.20)
Relative humidity 0.04 (-0.21, 0.19) -0.37 (-0.57, -0.18) 0.11 (-0.08, 0.27)
PAR 0.32 (0.16, 0.48) -0.28 (-0.51, -0.07) -0.02 (-0.29, 0.23)
Table 4: Posterior mean and 95% CI of the duration parameter coefficients.

5 Discussion

An extension of the HSMM was presented and applied to a high-frequency environmental dataset. We introduced non-homogeneity into the duration distribution of the HSMM using time-varying covariates. In our example, the model enabled the characterization of cyanobacteria concentration in a lake. The three states for phycocyanin represent low, medium, and high levels, and their mean and variance were estimated.

Using a zero-truncated Poisson distribution for the duration (hours) in each state, we investigated the variation in time spent in each state as a function of time-varying covariates. Important differences were detected in the relationship between the duration of time in the states and the covariates. The variability of the duration parameters in the different segments supports the introduction of non-homogeneity in the HSMM for this application.

We extended the results obtained in carpenter_stochastic_2020 by modeling the duration in the regimes of cyanobacteria concentration. With a different modeling approach, we identified similar states of cyanobacteria levels, and determined there are weather covariates associated with the duration in each of those states. Although not demonstrated here, this model could be applied to obtain predictions of the next states in the sequence and their expected duration.

The inference obtained from the duration in this model can be potentially associated with ecological resilience. arani_exit_2021 propose to measure resilience, the maximum perturbation that a system endures without transitioning to another state, with life expectancy. They present how to fit a Langevin equation to time series data to capture the different forces that affect a system, and obtain the mean exit time from a state to quantify life expectancy. The information provided by our model in terms of duration in a particular state before transitioning to another can be explored to measure life expectancy as well.

Other approaches exist for introducing non-homogeneity into HSMMs. For example, parameters corresponding to transition probabilities could also be modeled as a function of covariates. However, the focus of the analysis presented here was in capturing the variation in the duration of time in each state. Given that a direct transition from the low to the high state, or vice versa, is unlikely, it was not necessary to model the transition probabilities in terms of covariates for this application.

The subsampling approach used in the estimation of the means and variances helped reduce the effect of dependence in the observation sequence. Our simulation study provided guidance as to the level of subsampling necessary to properly account for uncertainty in the model parameters. An indirect result of the subsampling was reduced computation time. Since high-frequency data are becoming increasingly common, additional research will focus on accounting for conditional dependence in the data.


This research was funded by the Macrosystems Biology Program in the Emerging Frontiers Division of the Biological Sciences Directorate at the U.S. National Science Foundation (EF-1638554 and EF-1638550). Additional support was provided by NSF Award DMS-1811745 and by the Office of International Affairs and External Cooperation, University of Costa Rica.


Appendix A Sampling algorithm

Initial values

1:Define initial values for parameters .
2:Generate the state sequence as follows:
3:a: Sample the first state using .
4:b: Calculate using X and .
5:c: Sample from a zero-truncated Poisson with parameter .
6:d: Define .
7:e: Sample conditional on using .
8:f: Calculate using X and .
9:g: Sample from a zero-truncated Poisson with parameter .
10:h: Define .
11:i: Continue until .
12:Calculate and based on .


1:for iteration  do
2:     Update using Gibbs sampling:
where is the indicator function.
3:     Update the -th row of for , using Gibbs sampling:
where is the total number of transitions from state to state .
Algorithm 1 MCMC sampling algorithm
4:     Update for using random-walk Metropolis. Sample , where

is an identity matrix of order

, and define the proposal vector . Then calculate the Metropolis ratio as:
and if , with , let , and update , .
5:     Update and with the sampler in johnson_bayesian_2012 for the finite HSMM.
6:     Subsample the observations. Randomly sample the observations y according to the specified sampling rate, and use the new observation vector of size in steps 10 and 11.
7:     Update for using Gibbs sampling and the subsampling approach:
where is the indicator function and is the total number of observations of vector emitted by state .
8:     Update for using Gibbs sampling and the subsampling approach:
9:     Save and .
10:end for
Algorithm 2 MCMC sampling algorithm (continued)

Appendix B Simulation study results

The simulation study was developed for 2, 3 and 4 states. The empirical coverage is calculated as the percentage of 90% CIs that captured the true parameter values. Tables B.1 to B.3 present the empirical coverage for the different autocorrelation parameters, sample size and sampling rates utilized. This coverage corresponds to the average coverage of the different state means. The empirical coverage is similar for the two sample size and the number of states scenarios.

There are two important results provided in the tables. First, the more correlated the data are, the worse we do in recovering the true parameter values, as indicated by the first row of the tables where no subsampling was used. Second, the empirical coverage increases as the sampling rate decreases. As we reduce the percent of data used, we are able to reduce the dependence in the data and thus improve our estimates of uncertainty. For a case where the autocorrelation is approximately 0.75 and the whole dataset is utilized (sampling rate = 100%), then the coverage is low, indicating the need to use a smaller sampling rate. With data having autocorrelation close to 0.25, we obtain nominal coverage of the emission distribution means when using approximately 80% of the data points in the MCMC iterations.

Sampling rate % (n2500) (n5000)
0.25 0.50 0.75 0.25 0.50 0.75
100 78 68 44 76 68 55
90 83 73 50 81 74 60
80 87 80 54 86 76 66
70 92 84 62 90 81 72
60 96 85 67 93 84 77
50 98 88 70 94 88 81
40 98 93 76 97 94 86
30 100 96 88 100 98 91
20 100 99 92 100 100 96
10 100 100 100 100 100 100
Table B.1: Coverage percentage of the emission distribution means, 2 states case.
Sampling rate % (n2500) (n5000)
0.25 0.50 0.75 0.25 0.50 0.75
100 77 68 50 82 67 50
90 82 75 56 86 73 55
80 89 80 60 90 76 60
70 92 84 65 92 81 64
60 96 87 71 94 86 72
50 99 92 79 96 92 79
40 99 96 84 99 96 84
30 100 99 90 100 98 90
20 100 100 95 100 100 95
10 100 100 100 100 100 99
Table B.2: Coverage percentage of the emission distribution means, 3 states case.
Sampling rate % (n2500) (n5000)
0.25 0.50 0.75 0.25 0.50 0.75
100 80 66 52 80 67 53
90 86 70 57 86 72 56
80 89 75 61 89 78 58
70 94 81 66 93 83 62
60 96 85 69 95 86 69
50 98 90 75 97 90 76
40 100 96 82 98 95 83
30 100 99 87 99 98 90
20 100 100 94 100 100 96
10 100 100 99 100 100 99
Table B.3: Coverage percentage of the emission distribution means, 4 states case.