Body movement to sound interface with vector autoregressive hierarchical hidden Markov models

10/26/2016 ∙ by Dimitrije Marković, et al. ∙ University of Belgrade TU Dresden 0

Interfacing a kinetic action of a person to an action of a machine system is an important research topic in many application areas. One of the key factors for intimate human-machine interaction is the ability of the control algorithm to detect and classify different user commands with shortest possible latency, thus making a highly correlated link between cause and effect. In our research, we focused on the task of mapping user kinematic actions into sound samples. The presented methodology relies on the wireless sensor nodes equipped with inertial measurement units and the real-time algorithm dedicated for early detection and classification of a variety of movements/gestures performed by a user. The core algorithm is based on the approximate Bayesian inference of Vector Autoregressive Hierarchical Hidden Markov Models (VAR-HHMM), where models database is derived from the set of motion gestures. The performance of the algorithm was compared with an online version of the K-nearest neighbours (KNN) algorithm, where we used offline expert based classification as the benchmark. In almost all of the evaluation metrics (e.g. confusion matrix, recall and precision scores) the VAR-HHMM algorithm outperformed KNN. Furthermore, the VAR-HHMM algorithm, in some cases, achieved faster movement onset detection compared with the offline standard. The proposed concept, although envisioned for movement-to-sound application, could be implemented in other human-machine interfaces.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Traditional musical instruments generate sounds as a result of interaction between a targeted musician’s movement and the inherent physical properties of the instruments. However, when playing music an experienced musician does not move purely in a way that allows him to perform optimally on a particular instrument, but in a way that enables him to communicate personal experiences to an audience [jensenius2007action, gritten2006music, cadoz2000gesture]. With a rapid progress of digital technology, it became possible to separate human movement from the direct sound formation [bongers2000physical, paradiso1997electronic]. Although expanding possibilities of an artist to express itself through the custom sound waveforms, this leap has diminished the necessity for performer’s body movements; thus it has reduced the capacity of music and social interaction between a performer and an audience [ferreira2008sound, stuart2003object, schloss2003using].

One of the novel tendencies among digital instruments designers and artists, in the digital music act, is to increase somatic and corporeal presence [tanaka2000musical] and induce a rousing connection between movement and sound production. These modern music performances comprise of exaggerated gestures that are suitable for interfacing a digital controller [winkler1995making]. One of the most important requirements for creating an emphatic connection between a performer and an audience is a clear and coherent process of movement based sound production [mitchell2011soundgrasp, wang2008chuck, bevilacqua2007wireless, iazzetta2000meaning, choi1998motion, goldstein1998gestural]. [bahn2001physicality] have emphasised that this principle should be met regardless of the instrument’s construction, incorporated technology, and preferred playing style. Although the number of contemporary (body movement orientated) digital instruments and scientific publications related to them is rapidly increasing, there is a lack of systematic approaches for defining the methods used to detect and classify human movements designated as digital instrument input.

In this work, we have designed a digital instrument that is driven by body movements. The instrument comprises of several sensors (accelerometers and gyroscopes) integrated into small wireless units positioned on various body parts. The system is executing a real-time movement classification, using individual movement trajectories, where each gesture corresponds to a sound sample from the audio database. We named this digital music instrument ”Movezik”. We coupled the creation of this performance-oriented digital instruments with the development of novel human-machine interaction algorithms capable of establishing a causal link between human action and machine generated effect. We have formulated the classification algorithm as an online Bayesian classifier, in which we use a hierarchical hidden Markov vector autoregressive process (VAR-HHMM) [ephraim2002hidden, yang2000some]

to model dynamics of the real-time recordings of movement trajectories. We estimated the free parameters of the dynamical models—which approximate the movement dynamics—using a combination of two often used parameter estimation algorithms, the Expectation Maximization (EM)

[north1998learning] and the Viterbi algorithm [logothetis1999expectation, forney1973viterbi]

. Importantly, we compared the performance of VAR-HHMM with the k-nearest-neighbour (KNN) classifier, which showed highest classification accuracy, on the recorded movement data, among several classification algorithms, such as neural network classifier

[bishop1995neural], quadratic and linear discriminant classifiers [cacoullos2014discriminant, lachenbruch1979discriminant]

, and support vector machines

[meyer2015support, du2014support].

We quantified the classification performance of the two algorithms (KNN and VAR-HHMM) using well-established classification metrics [davis2006relationship, fawcett2006introduction]

, such as confusion matrix, precision score (PS), and recall score (RS). Also, to qualitatively evaluate the algorithms’ performance, we have designed and conducted a behavioural experiment. In the experiment, the human participants were asked to rate synchronisation levels of a pre-recorded movement and a short sound that was delayed (for a randomised amount of time) relative to the movement onset. We used the collected ratings to derive confidence interval of acceptable latencies, which we have set as the upper bounds on the algorithms’ performance.

Besides the direct implementation of this setup to a movement-sound mapping, the classification method presented here can also be applied to general human-machine interfaces that require a variety of highly synchronised movement triggered commands [haptic]. Specifically, the core algorithm can be implemented as the link between continuous actuators (human commands) and an execution of discrete robotic actions, e.g. in haptic robots, exoskeletons, artificial limbs, and wheelchairs.

Ii Methods

Ii-a Hardware system and movement recordings

The Movezik’s hardware system comprises of up to five sensorized wireless nodes connected to the multi-platform master application. Each node is dedicated to acquiring data from inertial measurement unit (IMU) with three axial accelerometer and three axial gyroscope. The IMU’s data was sampled at fixed frequency of and amplitude resolution of bits, whereas the sensitivity of the accelerometers and the gyroscopes was fixed to a predefined range. We have set the accelerometer’s range to —where denotes the value of the nominal gravitational acceleration—and the gyroscope’s range to . By connecting this wireless inertial sensor device (WISD) to an arbitrary body part (e.g. hands, legs, head, etc.) we can map the movements of that body part into the 6D state space defined by the IMU sensors.

We have recorded in total four different data sets using a WISD attached to the right hand (see Tab. I for details). The three out of four data sets consist of two complementary hand movements separated by resting periods (a stable hand position). The fourth data set consists of intermittent circular hand movements. An example of such circular hand movement and attached WISD is shown in Fig. 1.

Figure 1: Subjective movement-to-sound criterion (SMtSC) measurement setup. In each trial, sound sample is played in a pseudo-random manner in the range frames with respect to the onset of the clockwise circular hand movement. The wireless IMU node of the Movezik system is positioned on the wrist of the right hand.

Ii-B Subjective movement-to-sound criterion (SMtSC) table

The amount of acceptable latency between human kinetic action and associated sound is the main evaluation criterion for the performance of the proposed VAR-HHMM algorithm. Understanding and quantifying human perception of the movement-sound correlation is the interesting topic, which could also be useful for multimedia artists and engineers involved in similar projects. For the purpose of quantifying human perception we have designed a behavioral experiment in which we recorded participants estimates of a causal link between a movement and a lagged sound [vatakis2006audiovisual, zampini2005audio].

To exclude participants relating sound to arbitrary segments of the movement (classifying the end or an intermediate state of the movement as synchronized with sound onset) we used as a visual stimulus a recording of a continuous and smooth circular arm movement that was paired with an abbreviated sound sample. The duration of the clockwise circular arm movement was around and was preceded and succeeded by shorter periods of resting hand position (see Fig.  1). The video of the circular movement was recorded with a high-speed camera at , and later down-sampled to for displaying on regular monitors. The total duration of the recording was set to be ( of steady periods before and after the long circular movement). The sound sample was synthesized from a single digital piano tone () and additionally shortened by multiplying the recorded sound with the function, resulting in the short sound that has more than of its power within

. The exact moment of the movement onset was identified by visual inspection of the high-speed video. The sound latency marker (delay between the movement onset and the sound onset) was defined using the visually determined movement onset marker.

To automate the experimental procedure we have developed a python script based on PsychoPy application [peirce2007psychopy] that generated sample videos (trials) with pseudorandom sound latencies ranging from to frames ( to ), looping the trials until subject’s decision and logging response and decision time per each trial. The order of trials was the same for each subject, predefined as a random combination of trials ( of them) repeated three times (in total trials per subject). The experimental design with three repetitions of complete trials sets allowed us to: (i) excluded participants with high variability of responses to the same trial (not consistent participants); (ii) estimate the latency that defines a boundary between trials classified as synchronous and asynchronous.

Movement-to-sound synchronization was estimated based on responses from participants of different age ( years). During the measurement protocol we used laptop PC with LCD monitor (size inch and resolution x) positioned at eye level, approximately 1 m away from the subject. Subjects were receiving sound sample through in-ear headphones at self-preferred volume and were instructed to press the left arrow key to rate a trial as synchronous and the right arrow key to rate a trial as asynchronous.

Ii-C Offline classification algorithm

The expert classification (EC) method is an offline, threshold based algorithm combined with a checkup from the human expert. The threshold based algorithm follows the conventional two threshold methods (Fig.  2). It consists of an upper threshold for reliable movement detection and a lower threshold for detection of the movement onset. The algorithm goes as following: when a signal exceeds the upper threshold the algorithm searches previous samples until the lower threshold is reached; the sample on which the lower threshold is reached is stored as axis movement onset marker. Similarly, axis movement ending is marked at the point after the upper threshold exceeding where a signal reaches the lower threshold after. The algorithm for movement detection uses three parallel loops for three axes of rectified gyroscope signal. The upper threshold is set to of the maximal gyroscope value while the lower threshold is set to of the upper threshold. For the EC movement onset, first out of three axis movement onset markers from the 3 gyro axes is considered, and for the EC movement end the last out of three axis movement ends is considered. Following the automated algorithm detection of movement, the human expert manually inspects detection and classification to exclude false positives and false negatives.

Figure 2: Illustration of the offline classification algorithm with the two threshold method. Algorithm moves sample by sample until the upper threshold is exceeded (red dot); then, it cycles back until the value of the signal goes below of the lower threshold. The time point related to this value (green dot) is set as the movement onset.

Ii-D Segmentation of the movement recordings using the expert classifier

The recorded data sets that we used to test and compare the movement classifiers consisted of four separate data sets. Three out of four data sets combined recordings of two different movements separated by periods in a rest position at the end of the movement. We have recorded the following movements: (i) left-right hand movements (), (ii) up-down hand movements (), (iii) circular hand movements (; see Fig. 1), and (iv) supination-pronation of hand wrist (; elbow was kept in a stable position). The recordings of the hand resting in various position at the end of the movements were labeled as single movement .

The summary of several properties of the four data sets is provided in Table I. The number of movements in the last column of the table denotes the total number of recorded repetitions of movements within each data set. Hence, for the data sets consisting of two types of movements (e.g. left-right), each movement was repeated for half of the total number of repeated movements. Note that although the data set consisting of circular movement is significantly shorter than the rest, the total number of the recorded data points within the movement is roughly the same as in the other data sets, as the duration of circular movement is longer than that of other movements.

Prior to training the KNN and VAR-HHMM classifiers (see bellow) all the data sets were labeled using the labels provided by the offline classification procedure that is described in previous subsection. The half of the recorded data from each data sets were used for training of the online classifiers and the other half for testing the performance of the two online algorithms.

Data set Number of data points Number of movements
left-right 12800 20
up-down 15000 20
circular 3188 6
rotational 14201 20
Table I: The summary of the recorded data sets.

Ii-E K-nearest neighbors classification algorithm

Due to its simplicity, easy implementation and widespread use for classification purposes, weighted k-nearest neighbor (KNN) [hechenbichler2004weighted] algorithm was selected as the golden standard, out of the several widespread classifiers, for the time domain signals. To standardize optimization and test procedures we used the scikit-learning python library [scikit-learn]. We fixed the number of neighbours for the KNN classifier to , based on the optimisation procedure that maximised the classification accuracy over the four data sets.

Ii-F Vector Autoregressive Hierarchica Hidden Markov classifier

The vector autoregressive Hierarchical Hidden Markov classifier (VAR-HHMM) is based on a probabilistic generative model. The main assumption of the model is that the movement trajectories recorded using the Movezik system can be modeled with a pice-wise linear dynamical system. In other words, we assumed that each movement can be split into segments, where the dynamics of the recorded signal within each segment can be captured by an vector autoregressive (VAR) model. Furthermore, the transition between the segments is stochastic and captured by a Hidden Markov Model (HMM). Similarly, the transition between different movements is also captured by additional Hidden Markov layer that sits on the top of the hierarchy. The half of the recorded and pre-labeled data sets was used for training (fitting the free parameters of the generative model) and the other half for testing the online classification capabilities of the inverted model. The online classification of the recorded signals was performed using an approximate Bayesian inference procedure described bellow, where the label of the current data point corresponds to the movement with highest posterior probability.

Ii-F1 Generative model

The general form of an vector autoregressive model is defined as

(1)

where denotes the recording obtained at time step , and denotes an i.i.d. random normal variable with mean zero. The parameters of the AR() model that define the transition from previous to current measurements are denoted with , the captures the constant signal value. Importantly, the superscripts , and denominate the th segment of the th movement. Note that we have assumed that each movement can be separated on maximally five linear segments.

Thus, we can write the observation likelihood as

The transition matrix of movement segments for the th movement is given as

(2)

where

denotes the prior probability of the

th movement segments of the th movement. Similarly, the transition matrix between different movements is given as

(3)

where . Hence, we can write the full generative model as

(4)

Ii-F2 Online Bayesian inference

Using the above described generative model of the recorded data (Eq. 4) we can estimate the posterior movement probability at time step as

(5)

where denotes prior distribution over movements and movement segments , and denotes the normalization constant, which ensures that . The prior distribution is estimated using the posterior distribution at previous time step and the transition matrices of movements and movement segments (see Eq. 3 and Eq. 2 ), hence

(6)

Using the posterior movement probability we define the label of the recorded signal at time step as

(7)

Ii-G Estimation of model parameters

The accuracy of the above defined classification procedure is fully dependent on the values of the free parameters of the generative model. The parameters that define the VAR models and the transitions between the movement segments were estimated using a Viterbi algorithm [logothetis1999expectation, forney1973viterbi] combined with the Expectation Maximisation (EM) [north1998learning]. Importantly, using the Bayesian Information Criterion (BIC) [watanabe2013widely] we have determined that the time lag of the VAR model that provides on average the highest model evidence (highest BIC estimate) for the recorded data corresponds to . Hence, all the movements where fitted with the set of five VAR(1) models.

The fitting procedure of the Viterbi EM method goes as follows: (i) sample initial parameter values

from a standard multivariate normal distribution and set

to identity matrix; (ii) use the Viterbi algorithm to estimate the most likely sequence of segments over the training data for the

th movement; (iii) using the most likely sequence of segments estimate the elements of transition matrix of movement segments , and prior segment distribution ; (iv) re-estimate the parameters of the VAR models (, , and ) using a maximum likelihood estimate, and the most likely segmentation of the training data provided by the Viterbi algorithm; (v) repeat the procedure from step (ii) until the convergence of the log-likelihood . We have used the implementation of the Viterbi EM algorithm provided by the pyhsmm-autoregressive Python library [johnson2012].

As the above fitting procedure does not insure convergence to the global optima, we have repeated the estimation procedure for different initial conditions (step (i) of the algorithm), and estimated the log-likelihood over the training data set at the end of each repetition. In the end we keep only the parameter values that correspond to the run with highest log-likelihood estimate.

The prior movement probability was set to , and as the testing data always started from the resting position. Similarly, we have used the knowledge that the each movement in the testing data set will be either preceded or succeeded by a resting period, to define that movement transition matrix as follows

where , and (total number of different movements across all data sets).

Ii-H Evaluation of the performance of online classifiers

We have evaluated the effectiveness of the KNN and VAR-HHMM algorithm using a quantitative and a qualitative analysis:

  • For the quantitative analysis we estimated the movement onset detection error and movement classification error of the proposed algorithm in comparison to the KNN algorithm. The EC algorithm served as a benchmark for movement onset detection and movement classification evaluation. The quantitative analysis of classification comprised confusion matrix, precision score and recall score for the VAR-HHMM and the KNN algorithm using the data that was pre-labeled with the offline EC algorithm.

  • For the qualitative analysis we estimated the detection of the movement onset and compared it with the time window derived from SMtSC table. For the movement onset detection we estimated the time difference between the markers labeled by the VAR-HHMM and the KNN algorithms with respect to the EC offline markers. Furthermore, we imposed additional delay constrains to consider possible hardware latencies (Bluetooth, sound card and buffers)

To derive the time window from SMtSC data, we have first excluded subjects that had inconsistent responses for more than of trials. Using this criteria we excluded in total 4 subjects. We then encoded the subject responses (synchronous or asynchronous ratings) as if they rated the trial as synchronous and if they rated the trial as asynchronous. This encoding enabled us to perform the summation of individual responses for all trials to estimate rating distribution. Based on this distribution we were able to calculate the range of acceptable sound desynchronization with respect to movement onset.

Iii Results

Figure 3: A snapshots of the recorded signal from all data sets. (Top row) Movement classified using the offline EC algorithm. (Middle row) Movement classified using the online KNN algorithm. (Bottom row) Movement classified using the online VAR-HHMM algorithm. The red colored regions denote the data points which the corresponding algorithm misclassified with respect to the benchmark classification of the offline EC algorithm.

Iii-a Quantitative analysis

We will start the analysis by illustrating the typical difference between the EC, KNN and the VAR-HHMM algorithm. We show in Fig. 3 the long snapshots from four different data sets of recorded movements. The coloured regions in the top row of the graph show the benchmark labels that we determined with the expert classifier. The red shaded areas in the other two rows of the graph show the misclassified data points either by the KNN (middle row) or the VAR-HHMM algorithm (bottom row). One can notice that the biggest difference between the two algorithms is in the case of the circular movements (the right most column); the KNN classifier makes a significant number of misclassifications during the duration of the movement, while VAR-HHMM keeps a proper movement model throughout the snapshot, with only a few misclassified samples during transition between movement segments. Hence, as a next step, we will compare the two algorithms in more details using various classification metrics.

To evaluate the classification performance of the two algorithms across all recorded data sets we have estimated the confusion matrices, precision score and recall scores. The overall classification performance of the KNN classifier is shown in Fig. (a)a and of the VAR-HHMM classifier in Fig. (b)b. In ideal conditions, all the time points would be perfectly matched to the labelling obtained with the expert classifier and one would observe nonzero values only on the diagonal of the confusion matrix. However, due to the intrinsic variability of human movements, in practice, we find that certain amount of data samples is misclassified (non-zero off-diagonal elements in the two confusion matrices).

(a) KNN
(b) VAR-HHMM
Figure 4: The confusion matrix and the related classification metrics for the two online classifiers.

To estimate the average, per movement, classification accuracy we have estimated the precision score and the recall scores, which are shown on the top and right graphs, respectively, of Fig. (a)a and Fig. (b)b. The precision score measures the fraction of data points that were correctly classified with respect to all the samples that are assigned to the particular movement. Similarly, the recall score measures the fraction of data points that were correctly classified with respect to all the data points that truly belong to the specific movement label (as defined with the expert classifier). From Fig. (a)a we can see that the most of the KNN classification errors come from classifying a movement () as a resting state . Also, the circular movement is often misclassified as a left-right or an up-down movement as it incorporates fragments of those movements during its execution. In contrast, the VAR-HHMM classifier significantly improves recall scores (it correctly assigns a data point to its true movement label; see Fig. (b)b) with a slight decrease in the precision scores due to a more frequent labelling of the rest samples as movement samples.

Figure 5:

Histograms of classification lags. (Top row) KNN classifier, (bottom row) VAR-HHMM classifier. (Left column) Movement onset, (right column) movement end. Detections are labeled by different colors. It can be noted that both algorithms classify up-down movements onset early and reliably. Left-right movement onsets present much easier task for VAR-HHMM to handle compared to the KNN, resulting in less outliers, smaller variability and earlier detections. The most significant difference between algorithms is in the case of circular movement onset which is robustly detected by the VAR-HHMM but very poorly by the KNN.

To better understand the causes of the mixing movement and resting related data samples we have looked at the properties of movement onset and movement endpoint detection. In Fig. 5 we show the histograms of the classification lags (the time separation between the detected and true start and end points of the movements). From the movement onset histogram, we can infer that some of the misclassified resting samples can be attributed to the too early the detection of the movement onset and early detection of the movement end-points. Notice that the VAR-HHMM classifier registers movement onset on average earlier than the KNN classifier. Similarly, the KNN classifier registers the movement endpoint on average earlier than the VAR-HHMM classifier. This early detection of movement onset and endpoint explains why VAR-HHMM classifier shows more resting samples classified as movement and why KNN classifier shows more movement samples labelled as rest samples.

Finally, to compare the stability (the ability to maintain the active movement model in the presence of perturbations) and the flexibility (the transition to the actually active movement model without inherent time lag) of the two algorithms we have estimated the duration of misclassified blocks of samples. The histogram of the durations of misclassified samples is shown in Fig. 6. For sound orientated detections, the worst case scenario is switching to a false model and keeping it for a long time. Importantly, in both algorithms, the duration of the falsely labelled blocks of samples exponentially decline. Although it appears that the VAR-HHMM algorithm has on average shorter duration of misclassified blocks, the difference is subtle (approx.

) and non-significant as we cannot reject the null hypothesis (

, the two tailed F-test).

Figure 6: Histograms of durations of misclassified segments for the two types of classifiers. Main 4 (first 4) components of the KNN histogram contain , , , percentages of total errors in classification. These percentages for the VAR-HHMM are: , , , . Although number of misclassified segments are similar for both algorithms (174 for KNN and 166 for VAR-HHMM) total number of misclassified samples is significantly higher () for KNN due to greater dispersion of segments durations.

Iii-B Qualitative analysis

Although the classification metric allows us to compare the performance of the two classifiers directly it tells us little about their performance in real world applications. In other words, is the classification good enough to be useful in the Movezik system for movement to sound mapping? To test this, we used the experimental data to define the bounds on acceptable real-time classification accuracy. In Fig. 7 we show the histogram of participants’ positive responses (rating movement-sound relationship as synchronised) for various desynchronisation levels. As the Lilliefors’ test cannot reject the null hypothesis that the data follow a normal distribution with the significance level

, we have defined the acceptable desynchronisation between movement and sound onset as a range that spans one standard deviation away from the mean (

probability mass interval). Interestingly, this range also matches the range of all lags which more than half of the participants rated positively. Hence, the acceptable movement onset delay spans the interval from -3 frames () up to 13 frames (). Importantly, these values are very close to what others have reported in similar audio-visual synchronisation experiments [vatakis2006audiovisual, zampini2005audio].

Figure 7: Histogram of the total number of positive responses with respect to the sound desynchronisation. Blue bars dots denote the total number of times that the participants rated positively the relation between the movement and the lagged sound for a given level of desynchronization. The red line shows the fitted normal density ( frames, frames) multiplied by the total number of positive responses . The black dashed line shows the level corresponding to half of the total number of participants.

In addition to the experimentally derived latency threshold, we have also considered hardware and software limitations of the state-of-the-art devices and platforms. While Bluetooth latencies across different devices are unified, ranging from to , latencies originating from sound card responses fall in a range from less than in the case of PC, up to in some Android devices [superpower] (Android latency issue). With this in mind, we reduced acceptable detection lag range for different hardware and software platforms up to in the case of Android devices with high sound latencies (Fig. 8). We find that both algorithms had only one (out of 126 movements) perceivably delayed movement onset detection for the PC implementation, which corresponds to zero hardware latency. For the iOS based implementation the acceptable delay is reduced by which leads to two (out of 126) perceivably delayed detections for both algorithms. The performance of two algorithm starts to diverge only after decreasing the range of acceptable latencies to include Android based implementations. The VAR-HHMM classifier outperformed the KNN classifier for both fast Android devices ( delayed detections by KNN vs delayed detections by VAR-HHMM) and slow Android devices ( delayed detections by KNN vs delayed detections by VAR-HHMM). This results imply that further improvements to the VAR-HHMM classifier will be required for applying the proposed Movezik system in slow Andorid devices.

Figure 8: The total number of movements that were classified correctly only after the given time threshold. The vertical dashed lines denote the lag threshold above which the movements and sounds would be perceivably desynchronised on a specific hardware-software platform.

Iv Discussion

Here we presented a new sensorised hardware platform coupled with a novel methodology for learning dynamical models of movement trajectories. This methodology is suitable for fast movement onset detection and movement classification. The custom made hardware (Movezik) is based on a wireless IMU sensor suitable for human motion tracking, while the key control algorithms are platform independent. The proposed movement classification algorithm is based on a vector autorgressive processes (VAR) coupled with hierarchical hidden Markov models (HHMM).

As the main goal of the presented system is to interconnect human movements with sounds, the movement detection algorithm should provide fast and reliable detection of movements that can be translated into sound in real-time in order to create a clear perceivable correlation between a movement and a sound. To evaluate the qualitative and quantitative performance of the algorithm we used two benchmarks, the k-nearest neighbors (KNN) classifier and an offline expert classifier (EC).

The basic quantitative evaluation criterion was movement classification error which was calculated in respect to offline EC and benchmarked with common KNN method. The test was implemented on the recorded sequence comprising arm movements that are grouped on distinct movements, plus segments of resting arm positions between the movements. Overall, both VAR-HHMM and KNN algorithms produce similar classification outcomes, with KNN exhibiting on average higher precision scores and VAR-HHMM outperforming KNN in class recall for all but one movement (compare Fig. (a)a and Fig. (b)b). These differences between the two online classifiers can be explained using the fact that VAR-HHMM detect movement onsets slightly before the EC standard, whereas that KNN classifiers detects end of the movements before the EC standard (see Fig. 5). In addition, both algorithms produce similar distribution of the duration of misclassified blocks of data (see Fig. 6). As the the majority of misclassified blocks of data lasts only a couple of samples, it is possible to impose additional switching criterion to make compromise between the movement onset detections latency and the misclassification count.

Importantly, we find several advantages of VAR-HHMM algorithm compared to the KNN algorithm:

  • VAR-HHMM more accurately classifies complex movement trajectories (circular movement of the arm; see Fig. 4) that can be decomposed on simpler movement modules. While KNN assigns parts of the complex movement to similar simpler movements, VAR-HHMM detects and holds correct model state throughout circular movement (with only occasional model switching; see Fig. 3). This VAR-HHMM performance feature makes it advantageous in applications with large number of different movements, including complex limb trajectories.

  • VAR-HHMM exhibits less outliers during the detection of movement onset and movement termination (see Fig. 5). We consider this to be hard errors as they appear completely uncorrelated with the movement transitions.

  • VAR-HHMM exhibits smaller dispersion of the of the movement onset around the EC standard compared to the KNN algorithm. This difference is pronounce when narrowing acceptable latencies ranges that results in steeper increase of detection errors by the KNN.

  • VAR-HHMM shows potentially better performance then the offline EC algorithm as the fraction of movement onsets are detected before (see range in Fig. 5) the time stamp provided by the EC algorithm.

These advantages of the VAR-HHMM algorithm are also reflected in the qualitative analysis of the performance of the two algorithms. The qualitative analysis was based on the subjective movement-to-sound criterion (SMtSC) that was obtained from a behavioral experiment in which we asked participants to rate synchronisation levels between movements and sounds. The SMtSC measure suggests that the experimental trials with onset lag smaller than are perceived as synchronous. Using these threshold of subjectively acceptable lags for the detection of movement onset we find that both algorithms fulfill detect movement onset before this threshold for more than of movements (each had one detection outside this range). Furthermore, when accounting for an additive lag originating from the common wireless protocols and latencies related to executions of instructions between the application and sound card driver (processing lag due to the influence of background processes of a non-real-time operating system was omitted) we find that the percentage of detections above the perceptual threshold remained lower in the case of VAR-HHMM algorithm compared to the KNN algorithm (see Fig. 8).

Beside the above discussed advantages in the classification performance, another practical advantage of the VAR-HHMM based algorithm to train additional movement models without full recalculation of a models data base. This feature makes it more suitable in dynamic interfaces intended for adaptable human-computer interactions with increasing number of human inputs. In addition, a system that is based on the VAR-HHMM algorithm could potentially adjust models’ parameters in real-time. This is clear advantage over KNN and similar implementations that require substantially more processing time to recalculate models. In addition, KNN algorithm execution involves relatively large memory (storing whole training set for time based signal can easily overshoot 1 MB) and processor cycles per obtained sample (KNN needs to calculate a metric for all the samples of the training set) while VAR-HHMM algorithm needs only a fraction of a memory for storing linear models and considerably less processing power. This advantage permits execution of the VAR-HHMM algorithm in real-time on the vast majority of microprocessor architectures, even low-power devices that can be worn on the body during prolonged use.

Finally, in spite of several advantages of the VAR-HHMM algorithm over established standards such as KNN classifier, the improvements are still insufficient for immediate real world applications. The number of false positives and negatives are still too large for applications that require a robust control paradigm, and the lag for detection of movement onset is still large for the majority of the mid to low performance mobile devices that are currently available. In future studies we hope to improve the classification algorithm by testing out extensions of the linear VAR models to the nonlinear domain with nonlinear autoregressive neuronal networks

[billings2013] or convolutional neuronal networks [Taylor2010]. We expect that the improving dynamical models of movement trajectories will bring further improvements to the classification performance.

V Conclusion

Besides the application of the presented system as a movement-to-sound interface the presented methodology and technology can be easily extended and applied to any body-to-machine system. For this to be possible, the classification scheme should eventually achieve fast and reliable detection of movements that can be translated into an arbitrary sequence of commands in real-time and create a perceivable relation between the cause, a movement, and the effect, a command.

In conclusion, the here proposed system opens exciting possibilities for novel forms of creative expression and a multitude of still unexplored applications.

Acknowledgment

This work was partly supported by the Ministry of Education, Science and Technological Development of the Republic of Serbia (Project No. OI175016).

References