Revealing posturographic features associated with the risk of falling in patients with Parkinsonian syndromes via machine learning

07/15/2019 ∙ by Ioannis Bargiotas, et al. ∙ 0

Falling in Parkinsonian syndromes (PS) is associated with postural instability and consists a common cause of disability among PS patients. Current posturographic practices record the body's center-of-pressure displacement (statokinesigram) while the patient stands on a force platform. Statokinesigrams, after appropriate signal processing, can offer numerous posturographic features, which however challenges the efforts for valid statistics via standard univariate approaches. In this work, we present the ts-AUC, a non-parametric multivariate two-sample test, which we employ to analyze statokinesigram differences among PS patients that are fallers (PSf) and non-fallers (PSNF). We included 123 PS patients who were classified into PSF or PSNF based on clinical assessment and underwent simple Romberg Test (eyes open/eyes closed). We analyzed posturographic features using both multiple testing with p-value adjustment and the ts-AUC. While the ts-AUC showed significant difference between groups (p-value = 0.01), multiple testing did not show any such difference. Interestingly, significant difference between the two groups was found only using the open-eyes protocol. PSF showed significantly increased antero-posterior movements as well as increased posturographic area, compared to PSNF. Our study demonstrates the superiority of the ts-AUC test compared to standard statistical tools in distinguishing PSF and PSNF in the multidimensional feature space. This result highlights more generally the fact that machine learning-based statistical tests can be seen as a natural extension of classical statistical approaches and should be considered, especially when dealing with multifactorial assessments.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Postural control is the capacity of an individual to maintain a controlled upright position. Falls have been reported as one of the major causes of injury among elderly and more importantly among patients of balance-related disorders, such as Parkinsonian syndromes (PS). It has been estimated that one third of the population over 65 years-old faces minimum one fall per year

[1]. Falls promote the decrease in mobility, problems of autonomy in daily activities (bathing, cooking, etc.), or even death [2, 1]. Taking also into consideration the aging of many modern societies, accurate risk assessment has become a major challenge with huge socio-economic impact [3].

Force platforms are one of available acquisition tools of clinical researchers for the evaluation of postural control. Such platforms record the displacement of the center of pressure (CoP) applied by the whole body in time while the individual stands upon it and follows the clinician’s instructions. These CoP trajectories, usually called statokinesigrams, have been widely used in assessing the balance disorder in healthy or PS populations. It has been shown that CoP displacement characteristics can reflect individuals’ postural impairment when special acquisition protocols are followed [2, 4, 5].

Clinical research often aims to find the significant differences between fall-prone individuals and others who have not yet manifested important balance impairment. Researchers usually compute several features using signal processing techniques and evaluate their usefulness relying on a variety of available univariate tests, such as the Student’s t-test, Kolmogorov–Smirnov or Mann-Whitney Wilcoxon tests. However, usually in experimental works, where pre-planned hypotheses are not well-fixed, multiple univariate tests are applied consecutively in order to find the features that separate significantly the two groups. The aforementioned multiple testing scheme has been part of a well-known scientific debate


, mainly criticized for the increased probability of reporting a false-positive finding. More specifically, it has been reported that for alpha level

, it is possible that 1 in 20 relationships may be statistically significant but not clinically meaningful [6]. Thus, several biostatisticians recommend to disclose all the analyses that have been done, and not only the significant ones. The violation of this recommendation and the regular misuse of those tests [7] combined with the relatively small available cohorts, may lead to false conclusions and as a consequence to a significant lack of clinical consensus or at least delay in reaching it. Well-known adjustments have been proposed in order to limit the aforementioned probability of a false-positive finding (such as Bonferroni correction) but they have been reported as conservative compromises (due to the significant increase of the probability for false-negative output) [6] that do not constitute a satisfactory solution [8].

Classic statistical tests are very sensitive on the size of the available dataset. The generalization of any result is not safe when only relatively small populations are available (see [9] for the high risk of making false conclusions). In order to reduce this sensitivity, machine learning algorithms assess their results using cross-validation schemes. Briefly, an algorithm trains a model that ‘learns’ to solve the problem in a randomly selected part of the dataset (called training-set), and then tests whether it can be effective on the rest of the ‘unseen’ data (test-set). The learning and validation process is repeated multiple times and performance metrics are averaged. In the context of multidimensional datasets with binary labels , the idea of assessing the separability of two groups is based on the aforementioned learning and validation scheme. The learning process sets the criteria in order to rank the population in the test-set by means of a scoring function . Those who are ranked at the top of the list will be considered to belong to the positive class [10]. The machine learning community has recently made significant progress in this topic [11, 12], especially related to the design of appropriate criteria for the characterization of the ranking performance and/or meaningful extensions of the Empirical Risk Minimization (ERM) approach to this framework [13, 14]. In a large part of these efforts, the well-known criterion of the area under the ROC curve (AUC) is considered as the gold standard for measuring the capacity of a scoring function to discriminate groups of populations [10]. Briefly, in the setting of two-sample statistical testing, an algorithm ‘learns’ the rule that maximizes the AUC between the two groups in the training-set, and then tests the applicability of this rule to the test-set during the validation process.

Unfortunately, to the best of our knowledge, these novel advancements remain largely unexploited by the parkinsonism-related community. The lack of common language and proper methodological simplifications to make the approaches easy to understand by clinical researchers are possibly the major reasons for such an observed distance.

In postural research, simple acquisition protocols (such as the basic Romberg test) have been reported to contain inconclusive information to evaluate sufficiently the postural control of an individual [15]. However, only recently, works proposed that a combination of multiple global features, derived from CoP trajectories using data mining techniques, might be advantageous in order to classify fallers and non-fallers. Earlier works [16, 17], showed that although none of the features alone could classify effectively elderly fallers/non-fallers (i.e. weak classifiers), yet combining all features through non-linear multi-dimensional classification gave significant results. It is suggested that the shape of the decision surface lies indeed in a multidimensional space and should be learned using multiple features at once. As a consequence, the above findings raise reasonable questions about the ability of traditional statistical tools and testing protocols to fully reveal and exploit the existing associations.

The objective of the present study is to propose an easy-to-use-and-interpret two-sample hypothesis testing approach, in an attempt to address some the aforementioned difficulties of clinical research. Our contribution is to first propose a new variation of a multivariate two-sample test through AUC maximization, which was originally theoretically established in [10], and test it to a PS population which includes two groups: fallers (PS) and non-fallers (PS). We intend to highlight the benefits that one might have by using such kind of two-sample analysis in the presence of multiple features, and demonstrate the contradicting conclusions that a traditional statistical analysis (hypothetical future clinical study) might have had compared to the proposed method.

The remainder of the article is organized as follows: Population’s characteristics, acquisition protocol and analytical methodologies are presented in Sec. 2. Performance results are presented in Sec. 3. Discussion, limitations, conclusions and future perspectives are provided in Sec. 4.

2 Materials and methods

2.1 Balance measurements and fall assessment

Our dataset comes from the Neurology department of the HIA, Percy hospital (Clamart, France), and includes 123 patients ( years-old, Tab. 1) who suffered from Parkinsonian syndromes. PS patients that suffered from other comorbidities (such as vestibular and proprioceptive impairments) were not included in the study. Following the acquisition protocol, patients were asked to remove their shoes and to maintain upright position on a force platform keeping their eyes open and their arms at the side. The CoP trajectory was recorded for 25 seconds at that stance. After that, patients were asked to close their eyes maintaining their upright position. After a ten-second pause, clinical experts recorded 25 additional seconds with eyes closed (Fig. 1).

Characteristics Non-Fallers Fallers
Population 99 24
Age 78.8 5.3 78.5 5.9
Gender M:71/W:28 M:16/W:8
UPDRS III total score 23.6 11.9 26.3 11.1
Disease duration 4.7 3.5 5.7 4.2
Table 1: Characteristics of the 123 patients included in the dataset of our experiments.

Figure 1: Examples of statokinesigrams from fallers and non-fallers. The x-axis is the medio-lateral (ML) movement and the y-axis is the antero-posterior (AP) movement of the body in centimeters (cm) during the acquisition. As it can be observed, fallers and non-fallers are not easily distinguishable by examining visually their statokinesigrams.

Statokinesigrams were acquired using a Wii Balance Board (WBB) (Nintendo, Kyoto, Japan), which has been found to be a suitable and convenient tool for the clinical setting [18, 19], and the newly proposed portable package developed in our laboratory. Statokinesigram from the WBB are sent to the clinician’s professional Android tablet via Bluetooth connection. Acquired signals are sent (after anonymization and encryption) to a central database for high level processing (computation of features associated to postural control and application of appropriate algorithms [16, 17, 20]), and the demanded results are communicated to the clinician online. Since the WBB records the CoP trajectories at non-stable time resolution, the acquired statokinesigrams are resampled at 25Hz using the SWARII algorithm [21].

In order to label the participants, a questionnaire (implemented to the Android tablet) was filled for every subject registering information about falls during the last six months prior to the examination. As in previous works [22], participants were labeled as fallers (PS) if they had come to a lower level near the ground unintentionally at least once during that period. Twenty-four (24) patients were labeled as fallers. Any useful information about the conditions of falls were registered. The clinical trial registered at ANSM (ID RCB 2014-A00222-45) was approved by the following ethics committee/institutional review board(s): 1) Ethical Research Committees (CPP), Ile de France, Paris VI; 2) French National Agency for the Safety of Medicines and Health Products (ANSM); 3) National Commission on Informatics and Liberty (study complies with the MR-001). After information and allowing adequate time for consideration, written informed consent was obtained before participants are included in the study.

2.2 Choice of posturographic features

Our analysis included only features that were computed on the two-dimensional CoP displacement and have been previously proposed as indicators of postural impairment [23, 2, 24]. Tab. 2 provides the names, measuring units, and descriptions (where needed) for the features that were included in the test.

Feature Unit Description
RangeX cm
MaxX cm Maximum medio-lateral displacement (right)
MinX cm Minimum medio-lateral displacement (left)
VarianceX cm
VelocityX cm/s Average instant x-axis velocity of CoP changes
AccelerationX cm/s Average instant x-axis acceleration of CoP changes
F95X Hz Frequency below which 95% of the x-axis CoP trajectory’s energy lies
RangeY cm
MaxY cm Maximum antero-posterior displacement (front)
MinY cm Minimum antero-posterior displacement (back)
VarianceY cm
VelocityY cm/s Average instant y-axis velocity of CoP changes
AccelerationY cm/s Average instant y-axis acceleration of CoP changes
F95Y Hz Frequency below which 95% of the y-axis CoP trajectory’s energy lays)
DistC cm Instant distance from the center of the trajectory
EllArea cm Confidence ellipse area that covers the 95% of the trajectory’s points
AngularDeviation degrees Average of the angle of deviation
Table 2: Computed features derived from the CoP displacement during the acquisitions.

2.3 Two-sample test through AUC optimization (ts-AUC)

Although the proposed algorithm originate from [10]

, herein we present some algorithmic and cross-validation modifications. In the current work we use a bootstrap aggregation classification, in particular a random forest (RF)


that comprises several decision trees (DTs). Therefore, in the development of each DT, only a part of the whole dataset does participate (in-bag) while the other part is left out (out-of-bag, or OOB). Consequently, the OOB subset can be used as test-set for the the particular DT. In our approach, instead of the originally proposed testing method based on data splitting, we used the predictions of the OOB population

[26]. The number of DTs was large enough () compared to the actual population. The individuals can be selected in different OOB sets more than once. Every time an individual is part of an OOB set, the corresponding DT outputs the probability for him/her being a PS or a PS

. This is computed as the fraction of individuals of the positive class (fallers) in the tree leaf where he/she reaches. Thus, his/her final score is given by the average of the posterior probabilities over the trees he/she was part of the OOB set (see Fig. 

2). Averaged posterior probabilities () of the positive class (fallers) are used in order to compute the Mann-Whitney

-test statistic, denoted by

. The empirical AUC for the chosen hyper-parameters is given by

. Briefly, the null hypothesis, H

, and the alternative one, , are expressed as follows:


The OOB percentage was fixed to 36.8 of the included population. Searching the empirical (maximal AUC), the hyper-parameters that are optimized are the leaf-size and the number of features to be used per tree . We avoided a greedy approach using a Bayesian optimization process where only relatively shallow () and simple () DTs were allowed to be tested. The averaged posterior probabilities of the Star Model, where , are used to compute the scoring function (and the -value) through a univariate Mann-Whitney Wilcoxon (MWW from now on) test on the whole available dataset (see Alg. 1 and Fig. 2).

Figure 2: Scheme of the ts-AUC algorithm. In order to find the AUC (maximal AUC), a number of Random Forests (RFs). For the RF with the best AUC, the univariate Mann-Whitney Wilcoxon non-parametric two-sample test is applied on the average posterior probability values of the whole population.

Input: and are the points’ coordinates of the trajectory (statokinesigram);

, ,

are vectors with the required hyper-parameters.

Output: , , , -value.


  Step 1: Exploration of the space of hyperparameters

4:for  do
5:     for  do
10:     end for
11:end for
13:   Step 2: Choose the best model and apply MWW
Algorithm 1 The proposed ts-AUC statistical test.

2.4 Out-of-bag feature importance

Additionally, the proposed algorithmic modifications allow also the assessment of the importance of each feature to the ts-AUC final decision. We estimated also the out-of-bag feature importance by permutation. Briefly, the more important a feature is, the higher its influence (i.e. the increase) would be to the model’s error after feature’s random permutation at the OOB subset. The permutation of a non-influential feature will have minimum, or no effect at all, on the model’s error. Having features in the dataset and trees in the RF model, the influence of feature is computed as:


where is the average change of model error after the permutation of feature , and

is the standard deviation of the above change. Important to explain that every feature

participates only to the training of a subset of the trees of the RF. Therefore, and are derived by those trees in which the feature was selected to participate in their training.

Since our objective is to enhance interpretability of results, our feature importance analysis aims to identify all the important features, even those which are redundant or colinear, rather than finding a parsimonious set of important features. Hence, we followed the additional procedure proposed in [27] especially for interpretation purposes. Briefly, we computed the AUC of the OOB () of RFs starting from the most important feature, and adding progressively all the others in descending importance order. The best model is the smallest model (less features) with an higher than the maximum reduced by its empirical standard deviation (based on 20 runs).

2.5 Experimental settings

We compare the results obtained by the proposed ts-AUC with the Maximum Mean Discrepancy test (MMD-test) [12], which is a well-established multivariate test and state-of-the-art in terms of performance. The MMD measures the maximum difference between the mean of two data samples, in the space of probability measures of a Reproducing Kernel Hilbert Space (RKHS). Practically, this test is the unbiased squared MMD statistic. It has been proven to be highly efficient and easy to use (an available package with kernel optimization is provided in [28]).

In addition, we compare the results of ts-AUC with standard statistical testing approaches which are usually used in clinical studies. We checked the -values of all 17 features (i.e. ) with the labels {‘faller’/‘non-faller’} using the non-parametric Mann-Whitney Wilcoxon test. Typically, clinicians would report those features which were found statistically significant (e.g. with ) and any interesting non-significant finding.

In order to prevent the increase of the false positive probability due to the large number of tested hypotheses, -value adjustment procedures are applied. We use the Bonferroni correction, which is the most widely used -value adjustment in biomedical research. Moreover, after taking into account the criticism that Bonferroni has received [8], we also apply alternative approaches such as Holm-Bonferroni [29] and Sidak corrections [30].

Finally, we assess the effect of population size to the final result by performing the following two additional experiments:

  • We progressively decrease, uniformly at random, the population size by a step of 10% (95% to 35%).

  • We progressively reduce, uniformly at random, the number of PS by a step of 10% (95% to 35%).

At every step, all analyses run 12 times and the percentages of significant results were compared (see Fig. 6 and Fig. 5).

3 Results

The presented ts-AUC test was applied using the features derived from statokinesigrams from Eyes-Open and Eyes-Closed acquisitions. Tab. 3 contains the obtained -values for the two groups by the application of the ts-AUC and MMD tests. Both these tests agreed that the features derived by statokinesigrams of Eyes-Open significantly separated PS from PS, contrary to those from Eyes-Closed that did not show a significant result (Tab. 3). Therefore, we will henceforth continue by presenting detailed analysis only for Eyes-Open features.

Data type MMD result ts-AUC result
Eyes-Open H rejected * *
Eyes-Closed H not rejected
Table 3: The

-values obtained by the application of the ts-AUC and MMD tests on the features extracted from Eyes-Open and Eyes-Closed statokinesigrams. Features derived by Eyes-Closed statokinesigrams did not show a statistically significant result neither using ts-AUC nor MMD test. Therefore the study did not proceed to further analysis of these statokinesigrams. The statistically significant results are indicated by ‘ * ’.

The most influential features were found to be the VelocityY, VarianceY, AccelerationY, EllArea (Confidence Ellipse area), and MaxX (see in Fig. 3 their relative importance and in Fig. 4 their mean standard deviation per group). Tab. 4 indicates those features that showed and the decisions regarding statistical significance obtained after applying each of the three employed corrections. Interestingly, although the AccelerationY did not show statistical significance after the MWW application (), it was found as one of the influential features by the ts-AUC test. According to Tab. 4, using the results from the three corrections with level , none of the features would reject the H of two-sample MWW test.

Figure 3: The importance of features as estimated by applying the approach of [27] using the hyperparameters that produced the RF.

Figure 4: Radar chart comparing fallers and non-fallers based on the mean (o) standard deviation (-) of the most important features of our analysis. All six features are positively correlated with low postural control, which justifies the meaningfulness of inspecting the area of the curves in this chart. The profile of the two groups is significantly different.
Levels of significance after correction
Feature -value of MWW Bonferroni Holm-Bonferroni Sidak
EllArea 0.0045 0.0029 0.0029 0.003
VarianceY 0.006 0.0029 0.0033 0.003
MaxY 0.006 0.0029 0.0036 0.003
DistC 0.007 0.0029 0.0031 0.003
RangeY 0.008 0.0029 0.0038 0.003
VelocityY 0.009 0.0029 0.0071 0.003
MaxX 0.03 0.0029 0.0045 0.003
RangeX 0.04 0.0029 0.005 0.003
VarianceX 0.04 0.0029 0.0042 0.003
MinY 0.04 0.0029 0.0063 0.003
Table 4: Significant results of a univariate two-sample Mann-Whitney Wilcoxon (MWW) test, and the levels of significance after Bonferroni, Holm-Bonferroni, and Sidak corrections. Every -value presented in the MWW column is compared with the corresponding level of significance. After the corrections, -values derived by MWW were found to be always greater than the corresponding level of significance. Therefore, none of the features can reject the null hypothesis of equal medians at the default 5% significance level.

3.1 Population size

As expected, the decrease of population size had an important effect to the performance of all tests. Both ts-AUC and MMD test showed similar behavior with the progressive decrease of population size. Specifically, the number of times that the fallers and non-fallers were found statistically different was gradually decreased. After 55% of population size decrease, the two groups were found significantly different in less than 50% of the cases (Fig. 5). Univariate testing with MWW followed a similar decrease. Multiple testing showed that the the groups cannot be considered as statistically different (almost always).

Figure 5: The average performance of two-sample testing approaches with smaller population. The dataset size was progressively decreased by a step of 10. The included subset of each step was selected uniformly at random 12 times and the tests run in every iteration. We observe that ts-AUC and MMD have almost the same performance. Decreasing the population leads to lower chance of distinguishing the two groups. On the other hand, all the two-sample corrections present significantly lower performance.

Regarding Fig. 6, that shows the important role of the size proportion among the groups, the performance of ts-AUC, MMD, and multiple testing were comparable to those from Fig. 5 (uniform decrease of the population size). However, ts-AUC and MMD exhibit a less abrupt decrease of performance. On the other hand, the gradual balancing of the sizes of the two groups, through the exclusion of non-fallers, seems to have a minor effect on the univariate MWW testing.

Figure 6: The average performance of two-sample testing approaches with smaller non-faller population. The non-fallers were progressively excluded, by a step of 10%, in order to balance the size of the two groups without excluding fallers. The included subset of each step was selected uniformly at random 12 times, all fallers were included, and the tests run in every iteration. We observe that ts-AUC and MMD have almost equal performance. Decreasing the non-faller population leads to lower chance of distinguishing the two groups. On the other hand, all the two-sample corrections present significantly lower performance.

4 Discussion

The objective of this study was to introduce an easy, interpretable, and intuitive multivariate two-sample testing strategy. The particular interest of this study was to highlight the beneficial effect that this approach can have in clinical research, and particularly in the research of postural control in PS patiens. Using the proposed statistical testing approach, it was shown that: a) Different profiles between fallers and non-fallers were observed only for Eyes-Open protocol; b) The fall-prone PS patients have significantly different statokinesigram profile during quiet standing from those who are non-fallers, contrary to the classic multiple testing approach which did not agree with such a result; c) The novel multivariate two-sample testing approach (ts-AUC) showed equal performance with the state-of-the-art Maximum Mean Discrepancy (MMD) test, with the additional element of providing feature importance assessment. d) The VelocityY, VarianceY, AccelerationY, EllArea (Confidence Ellipse area), and MaxX, appeared to be the most important features for distinguishing fallers and non-fallers.

4.1 Comparison between multivariate and multiple testing

One of the main results of this article is that the proposed multivariate two-sample test, the ts-AUC, and the standard statistics (usually used in clinical studies), when both applied to the dataset of PS patients lead to contradictory conclusions. The multivariate approach found fallers’ and non-fallers’ statokinesigram characteristics significantly different, while traditional statistics did not confirm this result. The disagreement of the traditional approach seems to be linked to the relative conservatism of the traditional -value correction strategies (increase of probability of false-negative findings) [6, 8].

Researchers can always perform multiple univariate tests and not apply correction strategies (see univariate MWW results in Tab. 4, Fig. 5, and Fig. 6), and take the risk of having a false-positive finding. However, when modest evidence is found in relatively small populations after multiple testing, then the aforementioned false-positive probability is significantly high. The level of that risk may be controlled when some criteria are met (see [6]) considering the quality of the study, the quality of the dataset and the clinical strength of pre-set hypotheses. In exploratory studies though, some of the -values around 0.05, whichever side they may lie on, would definitely be considered as “interesting hints”, whereas concluding without thoughtful consideration from such findings should be generally avoided [9]. The multivariate and cross-validated approaches can decrease the aforementioned uncertainty. The proposed ts-AUC test has interesting and convenient properties: it is a test which is easy to implement and interpret, while it can be also applied to other similar multidimensional datasets.

4.2 Posturographic profiles - PS versus PS

The features included in our analysis have been used by clinical researchers in the past. Most of them were proposed as indicators of balance impairment at least once in the clinical literature (indicative references [23, 31, 2, 24]). We deliberately avoided any feature engineering or transformation process, not only because that goes beyond the scope of this study, but also because we intended to focus particularly on the merits of the newly proposed approach.

Interestingly, only the Eyes-Open acquisition allowed to significantly distinguish fallers from non-fallers in a population of PS patients. This result seems slight contradictory since PS patients exhibit increased dependency on visual sensing [32]. By exploiting the advantage of the ts-AUC test that provides automatically the importance assessment of features, we found that medio-lateral movement played also a role in faller/non-faller separation of PS patients (see Fig. 3 and Fig. 4). The medio-lateral movement has been reported as the most discriminative element between PS patients and age-matched controls [5] and seems that play a role in distinguishing fallers and non-fallers PS patients. However, the key-difference between fallers and non-fallers was spotted in antero-posterior movement. VelocityY, VarianceY, and AccelerationY, which may carry overlapping information, were found among the most influential features for the separation fallers/non-fallers separation. The aforementioned result is in line with previous works that reported increased antero-posterior movement of PS patients in quiet-standing conditions with eyes open [33, 34, 35]. Although many PS patients with low postural control did not manifest large posturographic areas, the confidence ellipse area (EllArea) was found significantly larger in fallers compared to non-fallers (Fig. 4). However, the EllArea

value of non-fallers was highly dispersed. Therefore larger fallers cohorts are needed in order to draw safer conclusions. The confidence ellipse area is recommended to be always considered together with antero-posterior features such as variance and velocity, in order to perform more accurate postural control evaluations.

4.3 Algorithmic aspects

The choice of using the OOB observations as cross-validation method has two basic advantages: 1) provides faster results in the AUC maximization process, and 2) allows the final MWW test to be applied once to the whole dataset, which is more intuitive for clinicians. In cases where the population size is sufficiently large and the hypothesis of similar distributions between train and test-sets is not violated, it is expected that more classic methods such train-test split (as originally proposed in [10]) would have given the same result (or even better; OOB prediction error results have been reported as slightly overestimated [36]). However, clinical datasets are usually limited in size and the aforementioned assumption about the same distribution is not always fully guaranteed. In these cases, multiple train-test splits seem more appropriate whereas they would significantly increase the testing process. OOB observations can be seen as an internal multiple train-test split (one per tree-learner) of the RF (each observation’s prediction is predicted by less than trees) but with the nice intuition that the final two-sample MWW test is applied once to the whole dataset after the validation process.

Another important modification is the addition of unbiased feature importance through random permutation of OOB observations. We believe that this property is a cornerstone of the proposed approach and inline with the current clinicians’ needs. While they need to know if two groups are (or are not) significantly separated, they are also interested to know the most influential features that lead to the reported result. Although the algorithm offers this convenience, we need to note that feature importance should be treated with extra care. The proposed approach tries to minimize the false conclusions concerning the importance of features when redundant or highly colinear features are present but the above topic is still under research. A general advice to clinicians can be to check for features exhibiting mutual information before the beginning of the testing process.

4.4 Population effect

The features computed by the basic Romberg test have been reported as relatively inconclusive in distinguishing fallers and non-fallers, mainly due to the lack of realistic conditions of fall [15]. The available patients’ dataset, with its relatively "marginal" separation between fallers and non-fallers (see Tab. 4), can be considered as an ideal dataset in order to check the performance of the newly proposed approach. We consider MMD algorithm (see Sec. 2.5) as the gold-standard method in terms of separability of the two groups. The fact that ts-AUC shows similar performance to that of MMD is very important, especially if we think that the proposed ts-AUC can also provide additional information about the most influential features without the need of any supplementary (meta-)analysis. Therefore, it would be fare to say that ts-AUC is competitive in terms of performance, while also boosting the interpretability of the result for the convenience of clinicians.

Interestingly, the decrease of the overall population and the gradual balancing between the groups of fallers and non-faller, showed that the proposed test is less conservative than the multiple testing process (with corrections). Exploratory studies, where a hypothesis about the structure of the dataset is not strictly defined in advance, could benefit from such multivariate approaches.

Comparing the results of the two population reduction schemes, i.e. the uniform reduction of the population versus the reduction of non-fallers (the larger group), we observe that all the statistical tests performed slightly worse in the former case. This was an expected result since fallers were only 24 out of the 123 available PS patients, and thus decreasing the size of that group made the fallers heavily underrepresented in the produced subsample.

4.5 Limitations

The first limitation of this study is the lack of sufficient evidence about the reasons behind falls. The basic Romberg test has been reported to be an insufficient protocol to provide such physiological information [15, 37]. Previous studies proposed richer protocols (including multi-tasking or use of foam surfaces [2, 4, 37]) for postural control assessment of fragile individuals such as PS patients. Undoubtedly, such protocols can have beneficial effect to the faller/non-faller classification, as well as to the impairment assessment of patients (visual, vestibular, somatosensor, nervous system). Yet, among the objectives of this work was to show that basic Romberg test does contain fall risk-related information, whose extraction and full exploitation is largely up to the adequacy of the employed statistical analytics.

It is worth noting that there is always some uncertainty in what patients report as their recent fall experience. Participants who were asked about previous falls might confabulate without a conscious intention to deceive (recall bias). Therefore, some of the non-fallers might be mistakenly labeled as non-fallers. Machine learning algorithms are usually robust to the presence of such noise which in our opinion is always minor.

In extreme cases of imbalanced datasets with many negative values and few positive ones, other metrics rather than AUC, such as precision-recall (PR) curve, F score or area under the PR curve, could be more appropriate in order to control possible overfitting [38]. We decided to keep the criterion (AUC), initially proposed by [10] for balanced datasets, in order to fulfill one of our main objectives: to propose the algorithm as understandable, interpretable and easy-to-implement as possible. In return, as it has been already mentioned, we controlled the leaf size () and features’ number () optimization procedure, and we applied cross-validation in each resulting case.

The use of Wii Balance Board (WBB) as a force platform during the acquisition protocol, is another mentionable limitation. The reliability of the WBB as a medical examination tool has been previously questioned [39]. Basic reported drawbacks were: a) the modest agreement with laboratory grade force platforms, b) the lower signal to noise ratio in its recording, and c) the irregular sampling rate [40]. We state that we are perfectly aware of the aforementioned limitations. However, the WBB presents an increasing popularity in posturography studies as a valid tool for assessing standing balance [18, 19]. It is an inexpensive piece of equipment and hence seems ideal for applications that intend to provide a quick and low-cost first scan of individuals with certain possibility of postural control loss. In addition, recent works [19, 21] showed that a careful preprocessing can mitigate some of its aforementioned drawbacks.

5 Conclusions and perspectives

In this paper we showed that using the proposed ts-AUC test, which is a two-sample test based on AUC maximization, faller and non-faller patients who suffer from Parkinsonian syndromes (PS) can actually be distinguished by examining posturographic features that are derived following the basic Romberg protocol. This novel approach was also able to indicate the posturographic features that are significantly different between the two groups. We confirmed that a fall-prone PS patient may manifest wider and more abrupt antero-posterior oscillations and larger posturographic areas compared to a non-faller. This separation appeared statistically less detectable when using more traditional approaches such as multiple testing. Interestingly, the above results were observed only in statokinesigrams derived by the Eyes-open protocol. The results of our study have highlighted that new multivariate methods based on machine learning, such as ts-AUC, can play an important role in assessing the usefulness of simple and inexpensive acquisition protocols as well as the extracted posturographic features.


The authors would like to thank Julien Audiffren for the initial database construction and the implementation of the statokinesigrams’ preprocessing (SWARII algorithm [21]) that we have used. We also thank Albane Moreau for providing the additional database information concerning the PS patients. Part of this work was funded by the IdAML Chair hosted at ENS-Paris-Saclay.


  • [1] M.E. Tinetti, “Preventing falls in elderly persons,” New England journal of medicine, vol. 348, no. 1, pp. 42–49, 2003.
  • [2] I. Melzer, N. Benjuya, and J. Kaplanski, “Postural stability in the elderly: a comparison between fallers and non-fallers,” Age and ageing, vol. 33, no. 6, pp. 602–607, 2004.
  • [3] J.A. Stevens, P.S. Corso, E.A. Finkelstein, and T.R. Miller, “The costs of fatal and non-fatal falls among older adults,” Injury prevention : journal of the International Society for Child and Adolescent Injury Prevention, vol. 12, no. 5, pp. 290–295, 2006.
  • [4] J.R. Chagdes, S. Rietdyk, J.M. Haddad, H.N. Zelaznik, A. Raman, C.K. Rhea, and T.A. Silver, “Multiple timescales in postural dynamics associated with vision and a secondary task are revealed by wavelet analysis,” Experimental Brain Research, vol. 197, no. 3, pp. 297–310, 2009.
  • [5] M. Mancini, P. Carlson-Kuhta, C. Zampieri, J.G. Nutt, L. Chiari, and F.B. Horak,

    “Postural sway as a marker of progression in parkinson’s disease: a pilot longitudinal study,”

    Gait & posture, vol. 36, no. 3, pp. 471–476, 2012.
  • [6] R.J. Feise, “Do multiple outcome measures require p-value adjustment?,” BMC Medical Research Methodology, vol. 2, no. 1, pp. 8, 2002.
  • [7] M.S. Thiese, Z.C. Arnold, and S.D. Walker, “The misuse and abuse of statistics in biomedical research,” Biochemia Medica, vol. 25, no. 1, pp. 5–11, Feb 2015.
  • [8] T.V. Perneger, “What’s wrong with Bonferroni adjustments,” British Medical Journal, vol. 316, no. 7139, pp. 1236–1238, Apr 1998.
  • [9] J. Wood, N. Freemantle, M. King, and I. Nazareth, “Trap of trends to statistical significance: likelihood of near significant p value becoming more significant with extra data,” Bmj, vol. 348, pp. g2215, 2014.
  • [10] N. Vayatis, M. Depecker, and S.J. Clémençcon, “AUC optimization and the two-sample problem,” in Advances in Neural Information Processing Systems, 2009, pp. 360–368.
  • [11] S. Clémençon, G. Lugosi, and N. Vayatis, “Ranking and scoring using empirical risk minimization,” in

    Proceedings of the International Conference on Computational Learning Theory

    , 2005, pp. 1–15.
  • [12] A. Gretton, K.M. Borgwardt, M.J. Rasch, B. Schölkopf, and A. Smola, “A kernel two-sample test,” Journal of Machine Learning Research, vol. 13, no. Mar, pp. 723–773, 2012.
  • [13] S. Agarwal, T. Graepel, R. Herbrich, S. Har-Peled, and D. Roth, “Generalization bounds for the area under the ROC curve,” Journal of Machine Learning Research, vol. 6, no. Apr, pp. 393–425, 2005.
  • [14] C. Cortes and M. Mohri, “AUC optimization vs. error rate minimization,” in Advances in Neural Information Processing Systems, 2004, pp. 313–320.
  • [15] R.M. Palmieri, C.D. Ingersoll, M.B. Stone, and B.A. Krause, “Center-of-pressure parameters used in the assessment of postural control,” Journal of Sport Rehabilitation, vol. 11, no. 1, pp. 51–66, 2002.
  • [16] J. Audiffren, I. Bargiotas, N. Vayatis, P.-P. Vidal, and D. Ricard, “A non linear scoring approach for evaluating balance: classification of elderly as fallers and non-fallers,” Plos One, vol. 11, no. 12, 2016.
  • [17] I. Bargiotas, J. Audiffren, N. Vayatis, P.-P. Vidal, S. Buffat, A.P. Yelnik, and D. Ricard, “On the importance of local dynamics in statokinesigram: A multivariate approach for postural control evaluation in elderly,” PloS one, vol. 13, no. 2, pp. e0192868, 2018.
  • [18] R.A. Clark, A.L. Bryant, Y. Pua, P. McCrory, K. Bennell, and M. Hunt, “Validity and reliability of the nintendo wii balance board for assessment of standing balance,” Gait & posture, vol. 31, no. 3, pp. 307–310, 2010.
  • [19] J.M. Leach, M. Mancini, R.J. Peterka, T.L. Hayes, and F.B. Horak, “Validating and calibrating the nintendo wii balance board to derive reliable center of pressure measures,” Sensors, vol. 14, no. 10, pp. 18244–18267, 2014.
  • [20] I. Bargiotas, A. Moreau, A. Vienne, F. Bompaire, M. Baruteau, M. de Laage, M. Campos, D. Psimaras, N. Vayatis, C. Labourdette, P.-P. Vidal, D. Ricard, and S. Buffat, “Balance impairment in radiation induced leukoencephalopathy patients is coupled with altered visual attention in natural tasks,” Frontiers in Neurology, vol. 9, pp. 1185, 2019.
  • [21] J. Audiffren and E. Contal, “Preprocessing the nintendo wii board signal to derive more accurate descriptors of statokinesigrams,” Sensors, vol. 16, no. 8, pp. 1208, 2016.
  • [22] A.A. Zecevic, A.W. Salmoni, M. Speechley, and A.A. Vandervoort, “Defining a fall and reasons for falling: comparisons among the views of seniors, health care providers, and the research literature,” The Gerontologist, vol. 46, no. 3, pp. 367–376, 2006.
  • [23] J.W. Błaszczyk, R. Orawiec, D. Duda-Kłodowska, and G. Opala, “Assessment of postural instability in patients with Parkinson’s disease,” Experimental Brain Research, vol. 183, no. 1, pp. 107–114, 2007.
  • [24] J.W. Muir, D.P. Kiel, M. Hannan, J. Magaziner, and C.T. Rubin, “Dynamic parameters of balance which correlate to elderly persons with a history of falls,” Plos One, vol. 8, no. 8, pp. e70566, 2013.
  • [25] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
  • [26] J.A. Doornik and H. Hansen, “Out-of-bag estimation,” Tech. Rep., Technical report, Dept. of Statistics, Univ. of California, Berkeley, 1996.
  • [27] R. Genuer, J.-M. Poggi, and C. Tuleau-Malot, “Variable selection using random forests,” Pattern Recognition Letters, vol. 31, no. 14, pp. 2225–2236, 2010.
  • [28] D.J. Sutherland, H-Y Tung, H. Strathmann, S. De, A. Ramdas, A. Smola, and A. Gretton, “Generative models and model criticism via optimized maximum mean discrepancy,” in Proceedings of the International Conference on Learning Representations, 2017.
  • [29] S. Holm, “A simple sequentially rejective multiple test procedure,” Scandinavian Journal of Statistics, pp. 65–70, 1979.
  • [30] Z. Šidák,

    “Rectangular confidence regions for the means of multivariate normal distributions,”

    Journal of the American Statistical Association, vol. 62, no. 318, pp. 626–633, 1967.
  • [31] M. Mancini, A. Salarian, P. Carlson-Kuhta, C. Zampieri, L. King, L. Chiari, and F.B. Horak, “ISway: a sensitive, valid and reliable measure of postural control,” Journal of Neuroengineering and Rehabilitation, vol. 9, no. 1, pp. 1, 2012.
  • [32] S. Rinalduzzi, C. Trompetto, L. Marinelli, A. Alibardi, P. Missori, F. Fattapposta, F. Pierelli, and A. Currà, “Balance dysfunction in parkinson’s disease,” BioMed Research International, vol. 2015, 2015.
  • [33] G.K. Kerr, C.J. Worringham, M.H. Cole, P.F. Lacherez, J.M. Wood, and P.A. Silburn, “Predictors of future falls in parkinson disease,” Neurology, vol. 75, no. 2, pp. 116–124, 2010.
  • [34] M. Matinolli, J.T. Korpelainen, R. Korpelainen, K.A. Sotaniemi, M. Virranniemi, and V.V. Myllylä, “Postural sway and falls in parkinson’s disease: a regression approach,” Movement Disorders, vol. 22, no. 13, pp. 1927–1935, 2007.
  • [35] M.D. Latt, S.R. Lord, J.G. Morris, and V.S. Fung, “Clinical and physiological assessments for elucidating falls risk in parkinson’s disease,” Movement Disorders, vol. 24, no. 9, pp. 1280–1289, 2009.
  • [36] S. Janitza and R. Hornung, “On the overestimation of random forest’s out-of-bag error,” PloS one, vol. 13, no. 8, pp. e0201904, 2018.
  • [37] J. Swanenburg, E.D. de Bruin, D. Uebelhart, and T. Mulder, “Falls prediction in elderly people: a 1-year prospective study,” Gait & Posture, vol. 31, no. 3, pp. 317–321, 2010.
  • [38] J. Davis and M. Goadrich, “The relationship between precision-recall and roc curves,” in Proceedings of the International Conference on Machine Learning. ACM, 2006, pp. 233–240.
  • [39] G. Pagnacco, E. Oggero, and C. Wright, “Biomedical instruments versus toys: a preliminary comparison of force platforms and the nintendo Wii balance board-biomed 2011.,” Biomedical Sciences Instrumentation, vol. 47, pp. 12–17, 2011.
  • [40] L. Castelli, L. Stocchi, M. Patrignani, G. Sellitto, M. Giuliani, and L. Prosperini, “We-measure: Toward a low-cost portable posturography for patients with multiple sclerosis using the commercial wii balance board,” Journal of the Neurological Sciences, vol. 359, no. 1-2, pp. 440–444, 2015.