Log In Sign Up

Analyzing Population-Level Trials as N-of-1 Trials: an Application to Gait

Studying individual causal effects of health interventions is of interest whenever intervention effects are heterogeneous between study participants. Conducting N-of-1 trials, which are single-person randomized controlled trials, is the gold standard for their analysis. In this study, we propose to re-analyze existing population-level studies as N-of-1 trials as an alternative, and we use gait as a use case for illustration. Gait data were collected from 16 young and healthy participants under fatigued and non-fatigued, as well as under single-task (only walking) and dual-task (walking while performing a cognitive task) conditions. We first computed standard population-level ANOVA models to evaluate differences in gait parameters (stride length and stride time) across conditions. Then, we estimated the effect of the interventions on gait parameters on the individual level through Bayesian linear mixed models, viewing each participant as their own trial, and compared the results. The results illustrated that while few overall population-level effects were visible, individual-level analyses showed nuanced differences between participants. Baseline values of the gait parameters varied largely among all participants, and the changes induced by fatigue and cognitive task performance were also highly heterogeneous, with some individuals showing effects in opposite direction. These differences between population-level and individual-level analyses were more pronounced for the fatigue intervention compared to the cognitive task intervention. Following our empirical analysis, we discuss re-analyzing population studies through the lens of N-of-1 trials more generally and highlight important considerations and requirements. Our work encourages future studies to investigate individual effects using population-level data.


Varying Joint Patterns and Compensatory Strategies Can Lead to the Same Functional Gait Outcomes: A Case Study

This paper analyses joint-space walking mechanisms and redundancies in d...

Clarifying Selection Bias in Cluster Randomized Trials: Estimands and Estimation

In cluster randomized trials, patients are recruited after clusters are ...

Identification of causal intervention effects under contagion

Defining and identifying causal intervention effects for transmissible i...

Generalizing Randomized Trial Findings to a Target Population using Complex Survey Population Data

Randomized trials are considered the gold standard for estimating causal...

An Artificial Neural Network for Gait Analysis to Estimate Blood Alcohol Content Level

Impairments in gait occur after alcohol consumption, and, if detected in...

1 Introduction

In the majority of research studies, the focus lies on identifying average effects in a population of individuals, such as in large cohort studies or randomized controlled trials (RCTs). However, especially if there are heterogeneous individual effects, it can be of great interest to investigate associations on an individual level. Estimating and testing these individual effects pose challenges. One approach is to employ statistical or machine learning models to estimate individual effects from population-level studies, and different methods have been proposed in recent years, such as causal inference methods bica2021real; alaa2017bayesian; shalit2017estimating; Lee2018. As another approach, a new study can be designed with the specific aim of investigating individual-level effects. For this, the study design of the so-called N-of-1 trials has been established as the gold standard nikles_essential_2015. In N-of-1 trials, the effect of one or more interventions on individual persons is investigated by measuring the outcome of interest over time across alternating phases in which the interventions are applied. N-of-1 trials are, therefore, a specific form of single-person crossover RCTs lillie2011n; mirza2017history. As a third approach, which we propose in this study, population-level data can be re-analyzed through the lens of N-of-1 trials. For illustration, we focus on an application to gait.
Gait can be quantified using spatiotemporal parameters, such as stride length, stride time, speed, or cadence. These gait parameters provide crucial insights into a person’s health status. For example, gait speed has been associated with life expectancy stanaway2011fast, and has been termed ”the sixth vital sign” middleton2015. Similarly, greater variability from step to step has been identified as a major intrinsic risk factor for falls in older adults HAUSDORFF20011050. Although gait is typically regarded as an isolated and highly automatic task, evidence suggests that gait patterns differ when concurrently performing a secondary task (e.g., cognitive or motor interference task). Such dual-task situations, which closely mimic daily life walking hillel2019every; BAYOT2018, have been associated with slower gait speeds and increased stride times  Ebersbach1995; Nohelova2021; SMITH2016; Montero-Odasso2012. Consequently, studying gait under these conditions may provide clinically relevant insights into gait modulations in daily life. A more comprehensive understanding of gait in real-life walking should also consider the aspect of physical fatigue because it knowingly affects gait kinematics and kinetics, which in turn is linked to a higher risk of slip-induced falls in healthy adults parijat2008effects. Similarly, for older healthy adults, lower limb fatigue affects spatial and temporal characteristics of gait santos2019effects, and leads to impaired movement control after overcoming an obstacle while walking hatton2013effect. Existing studies investigating the effects of muscle fatigue on gait performance reported heterogeneous outcomes. For example, for young healthy adults, Granacher et al. granacher2010effects observed significant decreases in gait speed and stride length, while Parijat et al. parijat2008effects reported no significant changes in gait speed. In older healthy adults, muscle fatigue only resulted in rather moderate changes in gait parameters santos2019effects. Regarding stride length, some studies reported an increase granacher2010effects; Barbieri2014; morrison2016walking, while others reported no changes hatton2013effect; TOEBES2014; Helbostad2007.
One possible explanation for the aforementioned discrepancy in effects of fatigue could be that the group-level analysis typically performed in gait studies did not capture the heterogeneous gait changes among individuals. It is known that gait characteristics are highly individualized and persist for a long period. As reported by Horst et al., classification accuracy for identifying 128 healthy individuals using their gait patterns sustained over 99% for one year  HORST2017. Moreover, there is evidence that gait changes in response to interventions are also individualized. For example, systematic gait training to modify footstrike patterns from rear-foot strike to mid-foot strike for runners resulted in heterogeneous changes of the footstrike angles chan2020effects, and the treatment response of gait patterns for neurological diseases such as Parkinson’s disease are also individualized marxreiter2018sensor; nonnekes2018towards. The highly individualized nature of gait and gait modifications suggests that individual-level analyses could provide insights that are not evident from population-level analyses. However, to the best of our knowledge, only one series of N-of-1 trials has been conducted on gait, in which Maguire et al. maguire2020replacing compared the effect of different walking aids on gait and balance for chronic stroke patients and revealed different responses to the new walking aid across the participants.
Here, we investigate how existing data from population-level studies can be re-analyzed through the lens of N-of-1 trial to estimate individual-level effects. To this aim, we use data from a population-based study in a repeated-measures design that investigated the effects of physical fatigue and cognitive task performance on gait. We estimate personalized gait parameters from Bayesian linear mixed models, and compare the results with standard population-level ANOVA models. Finally, we discuss re-analyzing population studies through the lens of N-of-1 trials more generally and highlight important considerations and requirements.

2 Materials and methods

2.1 Overview of the gait study

Sixteen young healthy participants (eight men, eight women) were enrolled in the study. Eligibility for the study was determined using the Physical Activity Readiness Questionnaire (PAR-Q) and only participants without medical restrictions for performing physical activities (i.e. with all negative responses) were allowed to take part in the study. At the first visit (see below), personal characteristics were assessed, including the daily activity levels of the participants using the International Physical Activity Questionnaire (IPAQ).

Figure 1: Study design. The study consisted of two visits which were seven days apart. The order of visits A and B was randomized and during each visit, the participants performed two 6-minute walking assessments using IMU sensors, separated by a fatigue protocol. During visit B, the participants performed a number-counting task while walking. ST: Single-Task (only walking), DT: Dual-Task (walking while counting numbers).

Figure 1 shows an overview of the study design. The study consisted of two visits, referred to as visits A and B in the following, which were seven days apart. The order of A and B was randomized, and during each of the two visits, the participants performed two walking assessments using IMU sensors, separated by a fatigue protocol. During visit A, the participant first completed a 6-minute walking assessment in a corridor with a distance of 35 meters in one direction. Then, the participant performed a repeated sit-to-stand all-out fatigue protocol to induce muscular fatigue in the lower limbs. Immediately following the fatigue protocol, participants repeated another 6-minute walking assessment. During visit B, the experimental procedure was the same as in visit A, except that a number-counting dual-task condition was added in both walking sessions, meaning that participants had to count numbers while walking. Details of the fatigue protocol and the cognitive task are described in Supplementary Text 1. In total, data from four walking sessions were collected for each participant: single-task control (ST-Control), single-task fatigue (ST-Fatigue), dual-task control (DT-Control) and dual-task fatigue (DT-Fatigue).

2.2 Statistical analyses

Descriptive statistics and population-level analyses

In a first step, we computed descriptive statistics of age, body weight, height and daily physical activity level of all participants. Next, we performed a population-level analysis using a two-way repeated measures ANOVA to serve as a reference for the comparison to the N-of-1 trial analyses. In ANOVA, we used stride length and stride time as outcomes and tested for the effect of physical fatigue and cognitive tasks, which were included as fixed factors. No further covariates were included in the model.

N-of-1 trial analysis using Bayesian mixed models

In our main analysis, we analyzed the data through the lens of N-of-1 trials and for each participant, we estimated the individual effects of the physical fatigue intervention and cognitive intervention on the gait parameters stride length and stride time. In contrast to typical N-of-1 trials with multiple crossovers, the data from our study consists of 4 blocks of repeated measures of the outcome gait parameters over the course of an intervention period for each participant, which is an experimental design that is typical of population-level studies granacher2010effects. In other words, the hundreds of gait measurements of each participant are contained in intervention blocks of either sequence ST-Control – ST-Fatigue – DT-Control – DT-Fatigue or of sequence DT-Control – DT-Fatigue – ST-Control – ST-Fatigue.

We used Bayesian linear mixed models to fit probabilistic models of the data distribution to the gait time series data for each participant, and separately for stride length and stride time. Such Bayesian models provide a probabilistic description of the data for interpretation makowski2019indices and allow to incorporate prior knowledge, which is not possible in conventional frequentist analyses. A model with first-order autoregressive (AR1) error structure was used which acknowledges that (for the same person) the covariance between errors from the observations may not be equal, and decreases towards zero with increasing lag, which matched well with the longitudinal stride parameters in our study deVries2013bayesian.

In more detail, we fitted the following model separately for each participant. Let denote the -th observation of a participant in the study, i.e., the observation at the -th timepoint. We assume a linear model , where is the () design matrix describing the fixed data structure, its row  = (, , , ) denotes observation , and

is a vector including the intercept

which is associated with the first combination of the groups in (i.e., the gait parameter of interest under ST-Control condition), as well as the changes of the gait parameter from ST-Control condition to the other conditions (denoted by , and ).

represents the error drawn from a multivariate normal distribution,

, where

is a variance-covariance matrix determined by the AR1 process, in which the exponent of the correlation declines linearly with the time lag



The Markov Chain Monte Carlo (MCMC) method with Gibbs sampling was used.Among all parameters of the model, the parameter of primary interest is the vector

. Combinations of its elements make up the mean gait parameter distributions for the four walking conditions (i.e., ST-Control, ST-Fatigue, DT-Control and DT-Fatigue). While informative prior distribution for the parameter can be directly inferred from studies on normal gait parameters of young healthy adults bernal2016reliability

, there is not enough information available to assume priors for the other parameters. As a result, we chose to use non-informative priors in the main analyses. We employed half-Cauchy distribution for

in the AR1 model as described in gelman2006prior, with default priors recommended by Gelman et al gelman2008weakly. In sensitivity analyses, we tested different informative priors, see section 2.2. In the sampling, we used 2 chains, 5000 burn-in steps, 1 thinning step (i.e. no thinning) and 10,000 iterations. To reduce the amount of computation, data used for the AR1 model were taken only from the left foot, and downsampled by a factor of five (i.e., selecting every fifth sample sequentially).

The convergence of the MCMC chain was confirmed with potential scale reduction factor (PSRF) and trace plots. A PSRF of 1 indicates chain convergence. Also, stable and uniform values over the iterations (i.e., a horizontal band with no particular patterns in the trace plots) for all sampling chains indicate convergence. MCMC chain resolution was evaluated using the effective sample size (ESS), which measures the efficiency of Monte Carlo methods such as MCMC martino2017effective

. Higher ESS indicates more information content, or effectiveness of the sample chain. More details on the MCMC diagnostics and on their results can be found in Supplementary Text 3. To confirm that the posterior estimates accurately represent the observed data, a posterior predictive check was performed by comparing the posterior distributions with the distribution of the observed samples. More specifically, the posterior distribution of the intercept and effects were used to re-construct the modeled distributions of gait parameters under the four conditions. These modeled distributions are then compared with the observed sample distributions using boxplots. The Bayesian analysis was performed using JAGS version 4.3.0, run from R version 4.1.1 (R Project for Statistical Computing). Formal specifications of the JAGS models can be found in Supplementary Text 2 and the R scripts used for running the analysis can be found at

Sensitivity analyses

To test how well alternative Bayesian models can estimate the posterior distribution, we implemented two additional models, a simple basic model and a time covariate model. In contrast to the AR1 model introduced in Section 2.2, both these models assumed that the errors are independent and identically distributed. As a basic model, we implemented a simple Bayesian ANOVA model with two fixed factors fatigue and cognitive task performance. Similar to the AR1 model, we assumed a linear relationship and normally-distributed errors, but we assumed here that each data point, namely, each stride from the same recording session, is independent of each other. As as second alternative model based on the basic model, we included a linear time trend by appending an incremental integer array to the design matrix. Apart from the linear time trend covariate, the model structure was identical to the basic model.

In further sensitivity checks, we compared models based on non-informative and informative priors for all three above-mentioned models. The investigated priors are summarized in Table 1. The distribution of informative priors was based on the corresponding gait parameter values reported for young healthy adults bernal2016reliability

which included a mean stride length of 1.36 m with standard deviation of 0.08 m; and mean stride time (estimated as doubled step time) of 1.05 s with standard deviation of 0.06 s.

Model Non-informative Priors Informative Priors (SL) Informative Priors (ST)
Basic &
Time Covariate
  • SL: stride length, ST: stride time

Table 1: Non-informative and informative prior distributions of the Bayesian models.

3 Results

3.1 Characteristics of study participants

In total, data from 16 participants (8 males, 8 females) were collected for the four walking conditions (ST-Control, ST-Fatigue, DT-Control and DT-Fatigue). The dataset consisted of 3117 strides pooled across all participants. Stride length and stride time from each stride were used as outcome variables in the analyses. The observations were balanced across the four walking conditions with 788 strides from ST-Control, 792 strides from ST-Fatigue, 766 strides from DT-Control and 771 strides from DT-Fatigue, across all participants. Also, within each participant, the numbers of strides were balanced under each of the four walking conditions. Table 2 summarizes the participant characteristics, and Table 3 summarizes the gait parameters.

Variable Mean  SD Min Max
Age 27.16  4.03 21 35
Body Mass (kg) 71.19  12.58 54 103
Height (cm) 173.78  8.85 157 190
Activity Level* 2 1 3
  • 1, 2, 3 means low, medium and high activity levels in IPAQ, respectively. Median is reported instead of Mean  SD, since data contain ordinal values.

Table 2: Participant characteristics.

3.2 Population-level analysis

Next, we performed baseline analyses to investigate the population-level effects of physical fatigue and cognitive task performance on gait. Two-way repeated measures ANOVA indicated very small effects induced by physical fatigue or cognitive task performance. The main effects of physical fatigue for both stride length and stride time had an effect size of 0.01 or less (stride length: F(1,15) = 5.86, p = 0.03,  = 0.01; stride time: F(1,15) = 2.56, p = 0.13,  = ). Main effects of cognitive task performance were moderate, with 0.15 and 0.20 for stride length and stride time, respectively (stride length: F(1,15) = 18.46, p = ,  = 0.15; stride time: F(1,15) = 21.14, p = ,  = 0.20). No significant interaction effects were found (p = 0.77 for stride length, and p = 0.99 for stride times). See Table 3 for a summary of the ANOVA results.

max width= ST-Control n=788 ST-Fatigue n=792 DT-Control n=766 DT-Fatigue n=771 Main Effect Control-Fatigue Main Effect ST-DT Stride Length Avg (m) F(1,15) = 5.86, p = 0.03,  = 0.01 F(1,15) = 18.46, p = ,  = 0.15 Stride Time Avg (s) F(1,15) = 2.56, p = 0.13,  =  F(1,15) = 21.14, p = ,  = 0.20

  • ST = Single Task, DT = Dual Task, Ctrl = Control, Fat = Fatigue, Avg = average. F = F-value of ANOVA, p = p-value of ANOVA,  = generalized eta squared (effect size). Summary of gait parameters are expressed as mean  standard deviation.

Table 3: Descriptive statistics and ANOVA results of the gait parameters from all participants.

3.3 N-of-1 trials using Bayesian linear mixed models

The posterior distributions for stride length and stride time are illustrated in Figure 2. See Supplementary File for a complete summary of the posterior distributions of parameters. The results showed that the baseline values of the gait parameters (under ST-Control condition) varied largely among all participants, and the gait changes under the four walking conditions were also highly heterogeneous among all participants. Figure 2 also shows the aggregated posterior distributions from all participants for reference, which further indicated that the aggregated population-level summary was not a good representation of the highly heterogeneous individual gait effects.

For stride length, there was a consistent trend among participants that the values from DT conditions were smaller than those from ST conditions, as seen in the population-level ANOVA main effect of cognitive task described above. Nevertheless, inter-person variation could be observed, for example, study participant 10 exhibited almost no response under the DT condition compared to the ST condition (mean values changed from 1.34 under ST to 1.33 under DT), whereas participant 2 largely reduced the stride length (mean values changed from 1.62 under ST to 1.35 under DT). In contrast, effects of physical fatigue on stride lengths were smaller on average but more complex on the individual level compared to those induced by the cognitive task, and opposite effects could be observed for different individuals. Especially under DT condition, stride length seemed to have increased from non-fatigue to fatigue condition for participant 6 (from 1.44 to 1.48), participant 14 (from 1.41 to 1.45) and participant 15 (from 1.45 to 1.48) but remained unchanged or decreased for the other participants. Overall, the variance of the posterior distribution was larger under DT condition compared to under ST condition. Moreover, the effects of fatigue were larger under DT condition compared to under ST condition for all participants. Similar trends could be observed for stride time, where the DT condition generally induced an increase for all participants, but the individual posterior distributions were heterogeneous and cognitive task performance seemed to have increased the variance as well as effects of fatigue for many participants. It is worth noting that the posterior estimates for participants 7 and 13 had unusually large variations compared to those for all other participants. Quality control analyses revealed that the MCMC chains did not converge for these two participants, more details are presented in section MCMC Chain Convergence in Supplementary Text 3.

Figure 2: Posterior estimations from the AR1 model for stride length and stride time. Top: Posteriors for stride length; bottom: posteriors for stride times.

To provide a qualitative overview of the heterogeneous gait changes under the four walking conditions for each participant, we computed the difference between each pair of conditions using mean values under the four walking conditions obtained from the posterior distributions. As illustrated in Figure 3, for great majority of the condition pairs, the gait changes vary in both magnitude and direction among all participants.

Figure 3: Heatmap of gait parameter differences between all combinations of condition pairs based on posterior estimations from the AR1 model. Each row represents difference of the gait parameters between one pair of conditions. Gait changes between condition pairs are heterogeneous among participants.

In the sensitivity checks, we investigated how the results might change when different regression models or different priors were used. Posterior distributions of the three models (AR1, basic, time covariate) were compared using the mean and standard deviation of the estimated model parameters. Overall, the AR1 and basic model had similar posterior distributions of parameters, and the time covariate model had slightly shifted distribution. The difference between ST-Control and DT-Control for stride length, as represented by , was similar in the AR1 and basic model and in observed data, and larger compared to the posterior estimate from the time covariate model (-0.09 from observed data, from AR1 model and from basic model, -0.06 from time covariate model). The difference between ST-Control and ST-Fatigue, as represented by , was negative in observed data and in the AR1 and basic model, but positive when estimated by the time covariate model (-0.03 from observed data, from AR1 model and from basic model, 0.03 from posterior estimate). Posterior estimates for stride times exhibited similar trends, in that the and estimates from the time covariate model were slightly different from those from the AR1 and the basic models. Comparison of models with different priors indicated that they had no meaningful influence on the sampling results. More detailed illustration of the effects of models and priors can be found in the Supplementary Figure S7. Hence, since no significant differences in the posterior estimates using non-informative and informative priors were found, we focused on reporting and discussing results from models using non-informative priors.

4 Discussion

In this study, we re-analyzed data from a gait study, which was originally designed as a population-based trial, as single N-of-1 trials. Despite the fact that the effects of cognitive task performance were generally larger than those of fatigue, some participants changed their gait patterns to a larger extent than the others. This is consistent with our prior knowledge that gait and gait responses to interventions are highly individual, and the posteriors for each individual allow further investigation into causes underlying these individual effects. In a real-life rehabilitation setting, by understanding under what conditions a person’s gait is especially susceptible to physical fatigue or cognitive challenges, targeted training or gait monitoring could be developed, thus reducing the risk of motor function-related injuries. Another interesting finding is that under the DT condition, the effects of fatigue seemed to have the same trend, but more pronounced compared to the ST condition for some individuals. This observation suggests that the cognitive task helped magnify the subtle effects of fatigue. In fact, dual-tasking has been used as an established paradigm to aid in the study of otherwise subtle motor function deficits in neurological diseases baetens2013gait. The great majority of studies involving the dual-task paradigm are on the population level; based on our observation that the dual-task effect did not manifest on all individuals, it would be interesting to explore what the constraints are, so that effective, personalized dual-task-based motor diagnosis can be developed. The above initial observations of the posterior estimates demonstrate that heterogeneous responses do exist in our cohort, and analyses based on N-of-1 trials are necessary to enable further in-depth investigation of the individual effects.

In typical N-of-1 trials, the effects of an intervention are studied by comparing data from treatment and control conditions from the same person. The study design can leverage many of the statistical and methodological concepts from randomized controlled trials to model multiple crossovers and time-dependent effects, including randomization, wash-in and washout periods to avoid carry-over effects, and placebo controls. The dataset used in our study was unconventional in the sense that it was obtained from a study originally designed for population-level analyses: instead of the multiple crossovers for typical N-of-1 trials, each participant was measured only in one session for the baseline, underwent the intervention once, and the intervention effects were measured in a subsequent session. Hundreds of data points (gait cycles) were measured under both conditions, which are sufficient for statistical analysis on a single person. Nevertheless, several assumptions were made to enable the analysis of the data through the lens of N-of-1 trials. For example, we assumed that the one-week break between the ST and DT visits did not induce any effect on the gait characteristics of an individual. Only in this case, the effect observed from DT condition could be attributed to cognitive task performance and not with time as a confounder. We based our assumption on evidence that an individual’s gait characteristics are persistent over a long period of time HORST2017. The one-week break can be considered a washout period, where the effects of fatigue exercise from the previous visit are sufficiently removed. For other types of outcomes that fluctuate over time or are more sensitive to uncontrolled factors, the effect of time between visits should not be neglected.

Another assumption was made to circumvent the lack of within-person randomization in typical N-of-1 trials. The order of ST and DT visits was randomized among all participants, such that exactly half of the participants performed the cognitive task during the first visit. However, with only one crossover, the order was fixed for the same person. In addition, the order of the control and fatigue conditions during both visits was fixed as well. In our analysis, we assumed that there were no interaction effects between the person and the order of the walking conditions. However, it is possible that carryover effects exist. In that case, the design does not isolate the effect from the intervention for the individual. As one approach, the carryover effects could be modeled in the analysis to still allow efficient and unbiased estimation of the effects 

gartner2022comparison. Such challenges would still be present when aggregated N-of-1 trial analyses or the traditional population-level ANOVA analysis are performed, but they might profit when there is some level of randomization (e.g., in our case, the randomization for ST and DT visits among the participants). As future work, additional crossovers between the fatigue and dual-task conditions can be added into the study design, in order to introduce randomization within one person.

In our analyses, informative and non-informative priors did not result in different posteriors for the same model, as illustrated in Supplementary Figure S7. One possible reason could be that the values of the non-informative priors were similar to those of the informative priors for both stride length and stride time. Only prior knowledge of the mean and standard deviation for the baseline (ST-Control) was introduced, and these values (centered around 1 with very small standard deviations) were close to the non-informative priors (centered around 0 with standard deviation of ). It is therefore worth emphasizing that for use cases where the informative priors differ largely from non-informative default priors, more detailed prior predictive check should be performed and an informed choice of priors might effect the posterior estimates to a larger degree. Visualizations similar to our Supplementary Figure S7 are helpful for qualitative comparison between different priors.

Posterior estimates of the model parameters from the AR1 and basic models matched the distributions of the observed values, whereas slight deviations could be observed for posterior estimates of the time covariate model. Moreover, for four participants (#2, 9, 12, 18), the MCMC chains did not converge for the model parameter which was associated with the linear representation of time in the design matrix. In our opinion, these observations indicate that the assumption of a linear effect of time with the time covariate model does not accurately represent the data. As discussed by Heckenstenden et al. hecksteden2015individual, repeated measurements during the course of a single uninterrupted intervention period could be used as a surrogate for repeated interventions, however, it is reasonable to assume autocorrelation between measurements, and non-linear adaptation may occur during the measurement period. Our study indicates that in such settings, the effects of time is more appropriately modeled with the temporal autocorrelation of the samples described by the AR1 model. It is worth noticing that the MCMC chains failed to converge for the AR1 models for two participants (#7, 13) for stride length. During data collection, we observed that the general gait patterns of these two participants were particularly affected by the interventions, and initial data exploration revealed a large variability in gait parameters. We assume that the true data distributions of their gait parameters are different from those of the other participants, and the true relationship between the variables are non-linear. In this case, a more flexible model with a non-linear structure could be better suited for analysis.

Based on the posteriors obtained from the Bayesian analyses and the assumptions discussed above, the effect sizes of interventions can be estimated and further investigated kelter2020analysis. Aggregated N-of-1 trials analyses can be performed to investigate the underlying causes of the personalized responses to intervention  gartner2022comparison. In our study, the different responses could potentially be associated with the participants’ sex, anthropometric features or stable lifestyle habits (e.g., as measured by the IPAQ questionnaire), or a combination of all these factors. In future studies, more heterogeneous cohorts with a larger variety of age or pre-existing health conditions, genetic endowment or epigenetic modifications will provide additional features for analysis. Based on these findings, personalized advice or treatment could reduce the risk of falls or injury for vulnerable individuals.

5 Conclusion

Our study provides an example of how to initiate an in-depth investigation of treatment effects on an individual level using data from population-level studies. We demonstrate the use of Bayesian models to study individual-level effects of interventions, and point out aspects to consider for future studies.

Data Availability Statement
The data and scripts used for running the analysis can be found at

Conflict of Interest
The authors have declared no conflict of interest.

The authors would like to thank all participants in this study. We would also like to thank Urs Granacher and Clemens Markus Brahms for their support in developing the study. This study received funding from the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – project number 491466077, and has been partly funded by the Federal Ministry of Education and Research of Germany in the framework of KI-LAB-ITSE (project number 01IS19066).

Supplementary Materials

Supplementary Text 1: Details of research methods

Fatigue protocol and cognitive task

During the fatigue protocol, the participants wore a weighted vest matched to 30 % of their body weight, repeatedly stood up from a chair at a self-selected, rapid pace until failure. Fatigue was assessed using the Borg Rating of Perceived Exertion (referred to as the Borg scale in following text) borg2006comparison, as well as blood lactate concentration measured using blood sampled from the earlobe finsterer2012biomarkers.

The cognitive task required participants to serially subtract the number seven from a 4-digit number between 3000 and 9000 provided by the experimenter, which was randomly generated anew for each participant. Participants were asked to verbalize the numbers (e.g., 3745, 3738, 3731, […]), so that the answers could be documented with the audio recorder and analyzed later. In order to minimize learning effects, participants practiced the dual-task 6-minute walking session one time before the actual recording trial with the exact same setup.

The study was approved by the ethics committee of the University of Potsdam (63/2020), and all experiments were conducted according to the latest revision of the declaration of Helsinki. All participants provided written consent prior to data collection.

IMU gait analysis

Two IMUs (Physilog®5, Gait Up, Switzerland) were attached to left and right insteps of the participants. Triaxial acceleration and angular velocity data were recorded at 128 Hz, the acceleration- and gyro-ranges were  g and  dps, respectively. An audio recorder was attached close to the left collar bone in order to document the responses of the cognitive tasks recorded under the dual task condition.

Stride length and stride time of all strides from the walking sessions were extracted from the raw IMU data using an error-state Kalman filter based algorithm, which utilizes zero-velocity update (ZUPT) to correct for accumulating drift errors caused by inertial integration. The algorithm has been described in detail, and validated against gold standard reference systems in previous studies 


. Briefly, triaxial acceleration and angular velocity signals were used as inputs, the stance phases were identified with gyro magnitude threshold, and an error-state Kalman filter was employed to periodically correct the drift during stance phases, thus enabling estimation of the three-dimensional trajectories of the feet movement. The gait events (toe-off, initial contact) were identified using signal features in in the gyroscope data. Temporal gait parameters, such as swing time and stance time, were derived directly from the gait events. Spatial gait parameters, such as stride length and clearance, were obtained by segmenting the trajectories into individual strides. Turning strides at the ends of the walkway were excluded based on a threshold on the change in foot orientation. Acceleration and deceleration phases were excluded by removing two strides before and after the actual turning strides. Additional outlier strides were defined as those whose gait parameter values were larger or smaller than three standard deviations around the mean (effectively 0.3% of the data), and were excluded from further analyses.

Supplementary Text 2: Details on Bayesian models

We used Bayesian linear mixed models to fit probabilistic models of the data distribution for the four walking conditions. They provide a probabilistic description of the data for interpretation makowski2019indices. This is in contrast to the conventional frequentist modeling, which produces a p-value which is widely misinterpreted when determining whether the hypothesis studied is true wasserstein2020asa

. The Markov Chain Monte Carlo (MCMC) method with Gibbs sampling was used to estimate parameters describing the data distribution. MCMC performs Monte Carlo integration by drawing samples from a probability distribution with the construction of a Markov chain, where the value of the current variable is dependent on the value of the previous variable in the chain. With sufficient numbers of steps, the sampled values will converge to the value which is being inferred 

geyer1992practical. Gibbs sampling is one of the most commonly used MCMC algorithms. It draws samples for each parameter from the full conditional distributions of that parameter smith1993bayesian.

The basic model with non-informative priors is specified as follows. (More details can be found, for example, in a tutorial on factorial ANOVA implementation using JAGS222, last retrieved on 2022-08-12.)

modelString = "
    model {
      for (i in 1:n) {
      mean[i] <- inprod(beta[],X[i,])
      beta[1] ~ dnorm(0,1.0E-3)
      for(i in 2:ngroups) {
      beta[i] ~ dnorm(0,1.0E-3)
      sigma ~ dunif(0, 100)
      tau <- 1 / (sigma * sigma)

The primary outcome of this study was the distribution of gait parameters under the four walking conditions, which can be derived from the model parameters. The JAGS program used in this work uses a dialect of the BUGS modeling language. In BUGS language, the normal distribution is parameterized in terms of precision (tau), which is the inverse of variance (sigma squared) plummer2017jags. In the model string, the parameter n in the likelihood model represents the total number of data points used for the simulation, whereas the ngroups in priors represents total number of elements in the beta vector, namely, the number of columns in the design matrix X.

The time covariate model was the same as the basic model, except that the design matrix X had an additional column with incremental integers for each walking condition, which represents the time component. The AR1 covariance model with non-informative priors was defined as follows. The definition of the half-Cauchy distribution was adopted from Gelman gelman2006prior:

modelString = "
    model {
        for (i in 1:n) {
        mean[i] <- inprod(beta[],X[i,])
        y[1:n] ~ dmnorm(mean[1:n],Omega)
        for (i in 1:n) {
        for (j in 1:n) {
        Sigma[i,j] <- sigma2*(1- phi*phi)*(equals(i,j) + (1-equals(i,j))*pow(phi,abs(i-j)))
        Omega <- inverse(Sigma)

        phi ~ dunif(-1,1)
        beta[1] ~ dnorm(0,1.0E-3)
        for(i in 2:ngroups) {
        beta[i] ~ dnorm(0,1.0E-3)
        sigma <- z/sqrt(chSq)    # prior for sigma; cauchy = normal/sqrt(chi^2)
        z ~ dnorm(0, 0.16)I(0,)  # positive part of normal distribution, Cauchy scale = 2.5
        chSq ~ dgamma(0.5, 0.5)  # chi^2 with 1 d.f.
        sigma2 = pow(sigma,2)

Supplementary Text 3: Details on quality control

MCMC chain convergence

Convergence of the MCMC chain was confirmed with visualization using trace plots and the convergence statistic potential scale reduction factor (PSRF). The trace plot displays sampled values over number of iterations for each chain and each model parameter. Stable and uniform patterns (i.e., a horizontal band with no particular patterns) for both chains indicate convergence. The PRSF is an estimated factor by which the current distribution of the parameter might be reduced if the simulations were to continue for an infinite number of iterations gelman1992inference. The PSRF plot shows the median and upper confidence limits (confidence = 0.95) against the number of iterations. An upper limit close to 1 indicates approximate convergence, as the current distribution is no longer over-dispersed in respect to the target distribution. In our study, both the trace plots and PRSF confirmed chain convergence for all simulations for stride length. Supplementary Figures S1 and S2 show example trace plots and PSRF plots of converged chains, respectively. For stride time, the chains from the AR1 models did not converge for participants sub_07 and sub_13 for all model parameters. Chains from the time covariate models did not converge for participants sub_02, sub_09, sub_12 and sub_18 for the parameter , which was associated with the time component in the design matrix. Examples of trace plots and PRSF for non-convergence can be found in Supplementary Figures S3 and S4.

Supplementary Figure S1: Example trace plots and density plots for MCMC chain convergence diagnosis for the AR1 model with data from sub_01 stride length.
Supplementary Figure S2: Example Potential scaling reduction factor (PSRF) plots for MCMC chain convergence diagnosis for the AR1 model with data from sub_01 stride length.
Supplementary Figure S3: Example trace plots and density plots for MCMC chain convergence diagnosis for the AR1 model with data from sub_01 stride length. The plots of chains that are not converged are similar for other models and their parameters described in this study.
Supplementary Figure S4: Example Potential scaling reduction factor (PSRF) plots for MCMC chain convergence diagnosis for the AR1 model with data from sub_01 stride length. The plots of chains that are not converged are similar for other models and their parameters described in this study.

MCMC chain resolution

The resolution of the MCMC chain was measured with effective sample size (ESS). A higher ESS indicates more information content, or higher effectiveness of the sample chain. In cases where observed data samples are highly autocorrelated, the ESS might be relatively small compared to the total sample size. Supplementary File shows the ESS for each model and parameter (n.eff).

Posterior predictive check

Figure S5 and Figure S6 show that for both stride length and stride time, posterior estimation of the data distributions for each participant and each experimental condition closely resembles the observed data distributions.

Supplementary Figure S5: Comparison of (A) posterior estimates and (B) observed values for stride length.
Supplementary Figure S6: Comparison of (A) posterior estimates and (B) observed values for stride time.

Compare models

Posteriors of the main model (AR1) and two alternative models (basic and time covariate) were plotted in combination with their corresponding non-informative and informative priors for comparison. Figure S7 and Figure S8 show means and standard deviations for posteriors of stride length and stride time, respectively. Informative priors did not have a visible influence on posteriors, whereas the three different models produced slightly different posteriors.

Supplementary Figure S7: Compare (A) mean and (B) standard deviation of posterior distributions from different models for stride length. Each data point in the boxplot represents aggregated value from one participant. Blue horizontal lines indicate mean values of the parameter calculated from the observed data, not all Bayesian model parameters could be directly calculated from the observed data. No effect of informative priors could be observed, whereas different models produced slightly different posteriors.
Supplementary Figure S8: Compare (A) mean and (B) standard deviation of posterior distributions from different models for stride time. Each data point in the boxplot represents aggregated value from one participant. Blue horizontal lines indicate mean values of the parameter calculated from the observed data, not all Bayesian model parameters could be directly calculated from the observed data. No effect of informative priors could be observed, whereas different models produced slightly different posteriors. Only posteriors whose chains have converged are plotted here.