Recent work has used Inertial Measurement Unit (IMU) sensors in commodity mobile devices to track the movement of human body parts. ArmTrak tracks arm movement, assuming that the body and torso are stationary Shen et al. (2016). The arm movements are estimated by incorporating human body movement constraints in the joints. Our work also uses IMU sensors in a sports analytics application, but it performs whole body movement recognition using a single smart watch, which is a hard problem.
The sport under consideration is amateur ballroom dancing, which engages tens of thousands of competitors in the U.S. and other countries. Competitors dance at different skill levels and each level is associated with an internationally recognized syllabus, set by the World Dance Sport Federation. The syllabus breaks each dance into smaller segments with well-defined body movements. Those segments are called figures. In the waltz, for example, each figure has a length of one measure of the waltz song being danced to; the entire dance is a sequence of 40 to 60 figures. The sequence is random, but the figures themselves are well-defined. The sequence is illustrated in Fig. 1.
The International Standard ballroom dances are a subset of ballroom dances danced around the world, and they include the waltz, tango, foxtrot, quickstep and Viennese waltz. A unique characteristic of all these dances is that the couple is always in a closed-hold, meaning they never separate. Also, both dancers in the couple maintain a rigid frame, meaning the arms and torso move together as one unit. The head and the lower body, however, move independently of that arms-torso unit. Our hypothesis in this paper is that the figures in each of these dances can be recognized with high accuracy using a single smart watch worn by the lead (usually man) in the couple. That is possible because the rigid frame makes it unnecessary to separately instrument the arms and torso, and because most figures are characterized by distinct movements (translations and rotations in space) of the arms and torso. We refer the interested reader to the website www.ballroomguide.com for free videos and details on the various syllabus figures in all the International Standard ballroom dance styles.
In this paper, we validate our hypothesis on the quintessential ballroom dance– the waltz. We chose 16 waltz figures that are most commonly danced by amateurs. The full names of the figures are included in Appendix A. Our goal is to accurately classify those figures in real-time using data from IMU sensors in a smart watch. That data can be pushed to mobile devices in the hands of spectators at ballroom competitions, providing them with real-time commentary on the moves that they will have just watched. That is an augmented-reality platform serving laymen in the audience who want to become more engaged with the nuances of the dance that they are watching.
The main beneficiary of the analysis of dance movements would be the dancers themselves. The analysis will help them identify whether or not they are dancing the figures correctly. If a figure is confused for a different figure, it may be because the dancers have not sufficiently emphasized the difference in their dancing and need to improve their technique on that figure. That confusion data could also be used by competition judges to mark competitors on how well dancers are performing figures; that task is currently done by eye-balling multiple competitors on the floor, and is challenging when there are over ten couples to keep track of.
We make three main contributions in detecting ballroom dance movements using a smart watch.
First, we minimize the number of sensors required, by using a single smart watch per dancing couple. We show that one IMU sensor is sufficient to distinguish between complex dancing movements around a dance floor.
Second, we identify and evaluate six learning representations that can be used for classifying the figures with varying accuracies. The representations are 1) Gaussian Hidden Markov Model, 2) Random Forest, 3) Feed-Forward Neural Network, 4) Recurrent Neural Network (LSTM), 5) Convolution Neural Network, and 6) a Convolution Neural Network that feeds into a Recurrent Neural Network.
Finally, we model the sequence of figures as a Markov chain, using the fact that the transitions between figures are memoryless. We use the rules of the waltz to determine which transitions are possible and which are not. With that transition knowledge, we correct the immediately previous figure’s estimate. This leads to an average estimation accuracy improvement of 5.33 percentage points.
2 Dataset Description
2.1 Data Collection
The data was collected using an Android app on a Samsung Gear Live smart watch. The app was developed for this work on top of the ArmTrak data collection app. We were able to reliably collect two derived sensor measurements from the Android API:
Linear Acceleration. This contains accelerometer data in the X, Y and Z directions of the smart watch, with the effect of gravity removed. The effect was presumably removed by calculating gravity at the instant when the couple stood perfectly still before the start of the dance.
Rotation Vector. This provides the Euler angles (roll, pitch and yaw) by fusing accelerometer, gyroscope and magnetometer readings in the global coordinate space. We use only the yaw in this paper because we have prior knowledge that there is no roll or pitch in the waltz.
In total, we collected readings from 4 sensor axes (three from the Linear Acceleration and the yaw from the Rotation Vector sensors). The readings were reported at irregular intervals (whenever a change in environment was sensed). In order to facilitate signal processing, we downsampled the data such that each figure contained exactly 100 sensor samples, which was possible because the effective sampling rate was greater than that. The downsampling was done by taking the median (instead of the mean, which is sensitive to outliers) values of 100 evenly-spaced time windows. From this point on in the paper, when we refer to “samples”, we refer to an observation for a figure of dimensionas one sample.
The app was developed in such a way that the button that started recording the above sensor measurements also simultaneously started playing the music via Bluetooth. That ensured that the music and the recording of the movements were time-synchronized.
For all the data that was collected, we used one song. We performed manual segmentation of the song using its beats offline, and that segmentation was used to segment the time series data for the dance into segments corresponding to figures in the dance. One figure is performed on one measure, so the length of each figure for our chosen song was sec. We noted the song intro length (where no dancing was performed) and ignored all data in that period. For each figure, we extended the window of measurements equally at the beginning and at the end by 0.35 seconds to account for slight errors in dancer timing. That ensured that the window captures the figures even if the dancer was slightly early or late to begin/finish dancing the figure.
The data was collected prior to the start of this course, and we would like the course instructors to not count that effort towards this project.
2.2 Cross-Validation Groupings
In total, we collected 818 figure samples across 16 different waltz figures, over 14 dances (figure sequences). The original data (not downsampled) for 4 figure samples from two figures are illustrated in Fig. 2. The input data thus had a dimension of for 818 figure samples, 4 sensors, and 100 measurements per sensor per figure sample.
Since the 818 samples came from 14 separate dances (figure sequences), we performed 7-fold cross-validation with two dances per cross-validation group (assigned randomly). That ensured sequences of figures (dances) were not split across different cross-validation groups. It also allowed us to test our representations’ accuracy for each sequence as a whole. In summary, 6 out of the 7 cross-validation groups were used for training, and one was used for testing.
2.3 Labeling Ground Truth
Each dance was recorded on video so that labels (ground truth) could be given to the data segments corresponding to the figures. The labels are listed in Appendix A.
3 Markov Transitions
The sequence of figures in each dance can be modeled as a Markov chain. The probability of observing the next figure is dependent on the current figure, but independent of past figures given the current figure. The reason is as follows.
Certain figures end on the right foot, while others end on the left foot. Similarly, certain figures begin on the left foot, while others begin on the right foot. The probability of going from a figure ending on the right foot to another figure beginning on the right foot is zero (and the same applies to the left foot). That is because of the physics of the dance and the way weight is distributed between the feet. Similarly, some figures must be followed by figures that move forward while others must be followed by figures that go backward. Hence there is a memoryless dependency between two consecutive figures.
Using the above rules, we constructed a transition matrix for all figures, and that is given in the Appendix in Table 3. We essential gave a zero probability to impossible transitions, and equal probability to all possible transitions. Therefore, our transition matrix is completely unbiased, and not based on real training data.
The advantage of the unbiased transition matrix is that the same matrix can be used across different couples since it is very general. The disadvantage is that it may not assist in settling ambiguities between two possible figures because it gives them equal probability. To address that limitation, we construct a transition matrix from the training data (for each of the 7 cross-validation groups) as follows. We start with a matrix that has ones in all possible transitions and zeros in all impossible transitions. Whenever a transition is seen in the training set, we add 1 to the value of the transition in the matrix. Therefore, all possible transitions have a value of at least 1 and at most the frequency of that transition in the training set. Finally, we normalize the matrix so that the rows all sum to 1.
4 Hidden Markov Model Representation
The dance can be represented as a Hidden Markov Model (HMM) where the states represent figures that emit sensor readings, as illustrated in Fig. 3. Although the state space is discrete, the emission space is continuous because the sensor readings are real-valued. As a result, the HMM cannot be solved using a discrete emission probability matrix. Instead, we assumed Gaussian emission probabilities, resulting in a Gaussian HMM. To make that Gaussian assumption valid, we modified the input data as follows. As described in Section 2.2, the samples lie in . We downsample that input by taking the mean of the 100 readings for each sensor, so that the resulting input lies in
We used the HMMLearn Python library 7 to estimate the transition and emission probabilities while fitting the input data. We initialized the transition probabilities with the trained transition matrix described in Section 3, and initialized state vector with the actual initial state obtained from the ground truth.
The problem with the HMM approach for this task is that the HMM is a generative model, and not a discriminative model. At no stage does the model take the actual known labels to perform classification. It simple estimates states using the probability information and we assigned labels to the states by fitting the training set, and matching the states estimated by the HMM with the known labels. The approach achieved an accuracy of 35.93% on the test sets, averaged across the 7 cross validation groups.
5 Random Forest Representation
We used the Extra Trees Classifier provided in the Scikit-Learn Python library Pedregosa et al. (2011) to classify figures directly from the downsampled data. That classifier is an ensemble method incorporating several (250 in our case) Random Forests and aggregating their results. Each input sample was in (4 sensors, 100 time series points per sensor), so there were 400 features. The approach achieved an average accuracy of 72.2%.
6 Deep Neural Networks Representations
We test three different deep neural network architectures, illustrated in Fig. 4
and a standard feed-forward neural network (not illustrated). In all layers, we used ReLu activations, except for the LSTM layers for which we used Sigmoid activations. The inputs to all the networks are the same, and are based on the cross-validation groupings described in Section2.2
. The outputs are also the same, because we want to obtain the probabilities associated with the different classes. Therefore, we use a softmax output layer with a categorical cross-entropy loss function.
We used the Keras5
package for Python, which provides an abstraction for a TensorflowAbadi et al. (2015) backend. We train the networks using the Adam solver Kingma and Ba (2014).
Feed-forward: All layers are densely connected. There are D dense layers, each of width W. We varied D and W, as given in Appendix C.
Convolutional (ConvNet): Since we are looking at 4 1-dimensional streams, we used 1-dimensional convolution layers. The first two layers are convolutional with 64 filters and a kernel size of 3. The next two layers are preceded by a max pooling operation, and contain 128 filters and a kernel size of 3. Another max pooling operation is added before D dense layers of width W. This particular architecture was inspired from the example in5.
Recurrent (LSTM): This layer has D Long Short-term Memory (LSTM) layersHochreiter and Schmidhuber (1997). The layers have a time history of 100 cells, and 4 features each (as per the input dimension). With a width of W, each LSTM layer has 400W cells. This makes the network very large, as given by the parameters in Apprendix C. This network took the longest time to train.
Hybrid Convolutional and Recurrent (ConvNet+LSTM): This network is a hybrid of the aforementioned convolutional and recurrent architectures. One LSTM layer replaces the dense layers in the convolutional architecture. The complexity of this layer is less than that of the pure LSTM network because the convolutional layers reduce the input dimensionality. As a result, this network trains faster than the pure LSTM architecture. This hybrid architecture was inspired from related work Morales and Roggen (2016).
7 Markov Correction
In this section, we propose a simple approach to combine the results of the deep learning representations (referred to as classifiers) with the Markov structure of the dance. Let be possible state from the 16 different figures, and let be the figure at time index . Then at each time index ,
We assume that the classifier is correct for the current figure and suppose
We correct the immediately preceding figure as follows.
We get from the trained transition matrix described in Section 3 and from the classifier.
8 Evaluation Results
The evaluation was performed with 7-fold cross-validation with groupings described in Section 2.2. The results are summarized below and include the best configurations for the neural networks. The results for all the different configurations are given in Appendix C. There is no directly related work that can be used for comparative evaluation. However, the accuracies presented can be compared with the accuracy of a random guess, which is .
|Extra Trees Classifier||72.20||73.48|
From the results, it is clear that the neural networks approaches outperform the Extra Trees Classifier (ensemble of Random Forests). The hybrid approach with the convolutional and LSTM layers performs the best.
On average, the Markov correction approach proposed in Section 7
is found to benefit all the classifiers. We illustrate this using confusion matrices, which capture the results for individual figures. Ideally, the confusion matrix should be theidentity matrix, because that would mean that the predicted figure was always the actual figure. However it can be seen in Fig. 5(a) that the left-foot closed-change (LCC) is most often confused for a whisk (W). However, a whisk is almost always followed by a progressive chasse (PC). The Markov correction approach recognizes this from the transition probabilities and corrects the estimation of a whisk to a left-foot closed change as soon as it sees that that figure was not followed by a progressive chasse. The improved classification results is illustrated in Fig. 5(b).
Markov correction sometimes hurts the classification results because the assumption that the current figure was correct may not be valid. If the current figure has been incorrectly classified, then that error in classification could be propagated to the previous figure. Fig. 6 shows the distribution of improvements. On average, the improvement was 5.33 percentage points.
9 Related Work
To the best of our knowledge, we are the first to use IMU sensing for dance recognition and dance analytics. Multiple accelerometers are used as input for dancing video games in Crampton et al. (2007). VICON systems were proposed in Dyaberi et al. (2004) and video recognition was used in Matthew Faircloth (2008). Those approaches do not work in our scenario where there are multiple ballroom dancers simultaneously on the floor, leading to occlusion. Also, they are expensive and not suited for amateurs.
Music segmentation studies for dance detection purposes are presented in Shiratori et al. (2004). Models for turning motions in Japanese folk dances are modeled from observation in Rennhak et al. (2010). Signal processing techniques used in dance detection are reviewed in Pohl (2010). Ballroom dance styles are differentiated in Schuller et al. (2008) from the music that is being played.
For human activity recognition, ensembles of deep LSTM networks were proposed in Guan and Plötz (2017), but this approach is not suitable for real-time prediction because it is too slow. A single deep LSTM network took nearly a whole day to train, from our experiments, and loading the weights for prediction was also very slow. Convolutional neural networks were proposed in Zeng et al. (2014). Our approach is similar, but we use more convolutional layers, as suggested in the Keras time series classification example 5. Our best results were obtained using the hybrid architecture between convolutional and recurrent neural networks, and that was proposed for human activity recognition in Morales and Roggen (2016).
In this paper, we presented a study of whole body movement detection using a single smart watch in the context of competitive ballroom dancing. Our approach was able to successfully classify movement segments from the International Standard Waltz, using deep learning representations. The representations alone achieved a maximum accuracy of 85.95%, averaged over 7 cross-validation groups. Using the fact that the segments can be represented as a Markov chain, the accuracy was improved to 92.31% by correcting the prediction for each preceding segment. The deep learning representations outperformed ensembles of random forests, and a Gaussian HMM representation performed poorly because it was not discriminative.
We would like to thank Sheng Shen for sharing his ArmTrak smart watch app code that was modified for this project. We would like to thank Aarti Shah, Akshat Puri and Anna Kalinowski for allowing us to record their dance movements.
TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Cited by: §6.
- Dance, dance evolution: accelerometer sensor networks as input to video games. In 2007 IEEE International Workshop on Haptic, Audio and Visual Environments and Games, pp. 107–112. External Links: Cited by: §9.
- Multi-scale convolutional neural networks for time series classification. CoRR abs/1603.06995. External Links: Cited by: §9.
- Phrase structure detection in dance. In Proceedings of the 12th Annual ACM International Conference on Multimedia, MULTIMEDIA ’04, New York, NY, USA, pp. 332–335. External Links: Cited by: §9.
-  Getting started with the keras sequential model. Keras. Note: Available https://keras.io/getting-started/sequential-model-guide/. Accessed December 2017 Cited by: 2nd item, §6, §9.
- Ensembles of deep lstm learners for activity recognition using wearables. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 1 (2), pp. 11:1–11:28. External Links: Cited by: §9.
-  Hmmlearn. hmmlearn. Note: Available http://hmmlearn.readthedocs.io/en/latest/tutorial.html#available-models. Accessed December 2017 Cited by: §4.
- Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: Cited by: 3rd item.
- Adam: A method for stochastic optimization. CoRR abs/1412.6980. External Links: Cited by: §6.
- COMBINING audio and video tempo analysis for dance detection. M.S. Thesis, University of Central Florida. Cited by: §9.
- Deep convolutional feature transfer across mobile activity recognition domains, sensor modalities and locations. In Proceedings of the 2016 ACM International Symposium on Wearable Computers, ISWC ’16, New York, NY, USA, pp. 92–99. External Links: Cited by: 4th item, §9.
- Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §5.
MOTION pattern recognition for interactive dance. M.S. Thesis, TECHNISCHE UNIVERSITÄT DARMSTADT. Cited by: §9.
- Detecting dance motion structure using body components and turning motions. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2264–2269. External Links: Cited by: §9.
- Tango or waltz?: putting ballroom dance style into tempo detection. EURASIP J. Audio Speech Music Process. 2008, pp. 6:1–6:12. External Links: Cited by: §9.
- I am a smartwatch and i can track my user’s arm. In Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services, MobiSys ’16, New York, NY, USA, pp. 85–96. External Links: Cited by: §1.
- Detecting dance motion structure through music analysis. In Proceedings of the Sixth IEEE International Conference on Automatic Face and Gesture Recognition, FGR’ 04, Washington, DC, USA, pp. 857–862. External Links: Cited by: §9.
- Convolutional neural networks for human activity recognition using mobile sensors. In 6th International Conference on Mobile Computing, Applications and Services, Vol. , pp. 197–205. External Links: Cited by: §9.
- Convolutional neural networks for time series classification. Journal of Systems Engineering and Electronics 28 (1), pp. 162–169. External Links: Cited by: §9.
Appendix A Appendix: Waltz Figure Information
Ballroom dancing competitions in the U.S. are conducted at multiple skill levels. The syllabus at the skill levels restricts which figures can be used. In this paper, we focus on Waltz figures at the Newcomer and Bronze skill levels, because the majority of amateur dancers compete at these skill levels. The figures can also be used by dancers at more advanced skill levels, such as Silver and Gold. Table 2 gives the names of the figures that we consider in this paper, along with the short names used throughout the paper.
|Left Foot Figures||Right Foot Figures|
|Left-foot Closed Change (LCC)||Right-foot Closed Change (RCC)|
|Natural Turn 4-6 (N2)||Natural Turn 1-3 (N1)|
|Natural Spin Turn (NST)||Reverse Corte (RC)|
|Reverse Turn 1-3 (R1)||Reverse Turn 4-6 (R2)|
|Chasse to Right (CTR)||Chasse from Promenade (PC)|
|Outside Change (OC)||Basic Weave (Weave)|
|Double Reverse (DR)|
|Back Whisk (BW)|
|Back Lock (BL)|
Appendix B Uniform (Unbiased) Transition Probabilities
Table 3 describes the uniform (unbiased) transition probabilities between figures of the international standard waltz.
Appendix C Appendix: Detailed Results for Neural Network Configurations
The following table contains the detailed results and the number of model parameters for each of the architectures described in Fig 4. Note that refers to the number of hidden layers for the feed-forward neural network, but it refers to the number of LSTM/Dense layers in the other architectures (described in Fig 4)