1 Introduction
Accurate control for human motion tracking is a key requirement in many applications including humanrobot interaction, teleoperation systems, exoskeletons and surveillance systems. For these applications, better motion prediction models enable better control. But as the system complexity increases, conventional methods which require handcrafted models also become increasingly challenging to design. Deep learning architectures can provide this model given enough data.
Using deep neural networks
endtoend for difficult robot manipulation tasks was proposed in [1] but it is too datainefficient. Contrary to this, several recent approaches have used a Model Predictive Control (MPC) framework such that deep learning is used only in the part which is difficult to model. For example, [2] learns complex contact dynamics for robotic foodcutting. In [3], the mapping of actions to image pixel motion is learned for visionbased manipulation tasks. In [4], a model is learned for predicting forces in a robotassisted dressing task. In [5], the dynamics of aggressive driving is learned for controlling an autonomous car.This work is in the same area of research where neural networks learn complex predictive models for use within an MPC framework. Specifically, we are learning models to predict human motion. As the representative task, the robot here has to write characters at the same time as a human, as shown in Fig. 1. We chose this task to leverage existing datasets on character writing such as [6]
. In addition to being a new application area, we also design the MPC to be able to switch to a more conservative prediction. Furthermore, we also explore online learning using the Dynamic Boltzmann Machine
[7] neural network.2 Model predictive control framework for tracking control
A general MPC formulation can be expressed as:
argmin  
subject to 
where the resulting sequence of future control actions, is obtained by optimizing the objective function, , under the constraint of the system dynamics equation where is the resulting next state when the action, is applied while in state .
The functions and/or can be fully or partially replaced by neural network models. This design choice leads to several different approaches. Here, we are using a neural network only as a part of and design it such that the neural network is a model to predict the future human motion.
For , we assume that we can freely control the endeffector and that the motion is smooth so that the trajectory is differentiable three times. Doing so, we can define the endeffector state as the Cartesian position, velocity and acceleration. We then use the jerk for control. For a single time step,
and a single degree of freedom (DOF), the equation for
is linear such that: where:We can apply the same model independently for the three translations. To obtain a vector of future states,
, we can recursively apply to get an arbitrarily long sequence of future states. Doing so, has length (position, velocity, acceleration for 3DOF and N timesteps) while is a column vector, with length (jerk for 3DOF and N timesteps). A linear model for can still be written such that: where is the initial state. The matrices are made from through a process known in MPC literature as condensing.For the objective function, we need to track the motion of the character being written. This can be done by minimizing , the L2norm of the state to a target state. requires future information so we need a predictive model. Here, we propose switching between two models: a conservative model, , which predicts no motion: just copying the last position with zero velocities and acceleration. This simple model has significant tracking error especially with quick motions but produces slower, more conservative motions since it is similar to only doing feedback control without prediction. The other model is a Neural Network which takes as input a running history of the current state, , and produces the prediction. This is explained in the next section. The last term of the objective is for smoothing out the control action. The final objective function is then built by adding gains and weights :
The first two objectives are designed to achieve the same goal, so the weights are designed to be a homotopy with . Normally, only one of these objectives are active so that or . However, when switching, a small transition period is needed where is varied smoothly. Meanwhile we only need the last term for smoothing/regularization so: The resulting control problem can be solved quickly and efficiently as a quadratic programming (QP) problem.
3 Human motion prediction with neural networks
Human motion prediction with neural networks is also a topic of interest outside robot control, for example in [8, 9]. The implicit assumption is that there is an underlying motion pattern such that given a sufficiently long history of the current motion , we can predict by learning the parameters of the neural network model . For example, in our task when most of the letter is written, it should be clear which letter it is and this provides enough context to predict the future motion. A wellknown issue here is that the first few predictions will be bad since there is not enough history to provide a proper context yet. This is why we added the conservative model in our MPC and the functionality to switch between models.
The problem of human motion prediction is a wellstudied subclass of sequence modeling where Recurrent Neural Networks (RNNs) have shown good results. The
Long ShortTerm Memory (LSTM) model [10] is the current standard for RNNs and used in benchmarks, for example in [8, 9]. Although these RNN models have shown impressive results in several application areas, one concern here is the training time because all these models require back propagation through time. This is clearly not suited for online learning. At testing time, the forward pass is fast enough to be suitable for the robot control application we present. The disadvantage is that once the model is trained it has to be kept as it is.Another neural network model that is suitable for timeseries prediction is the Dynamic Boltzmann Machine (DyBM) presented in [7]
. It is an energybased model designed for timeseries prediction with training speed considerations in mind so it does not use backpropagation through time. It is also designed for online learning for edge devices. Recently,
[11] compares a variation of the DyBM with the LSTM and the results are comparable in terms of prediction error. The advantage is that the reported training time of the DyBM is of the LSTM. This is a significant advantage for our target applications.Apart from the specific architecture of the neural network model, another design choice is the method for training which would dictate the function learned.
We are training the network to do a onestep prediction. To produce the required steps prediction, we use the previous prediction result as the next input. A known issue of this technique is that the predictions will progressively worsen. This does not affect the MPC since it has a structure where later predictions have less weight in the optimization procedure. An advantage of this technique is that can be arbitrarily set as the model is independent from it. Lastly, after the steps prediction is created, the internal state of the LSTM and DyBM should be reset to just after the first prediction. This ensures continuity of the real input sequence inside the memory of the NNs.
4 Results and discussion
We evaluate our framework on the human handwriting dataset provided by [6]. The data is composed of the alphanumeric characters and basic math symbols written several times by 11 people. It is already divided into three sets: two training sets and a testing set. Here, we used only the “training1” set consisting of sequences for training. All the tests and validation are then done using the “testing” set which has sequences. The data itself is composed of a series of positions in a 2DOF coordinate system. As a normalization step, the series of positions are converted to velocities by finite differencing. The penup and pendown events are removed such that there is a large computed velocity during this event. At the end, zeros are appended to learn the concept of stopping after the writing stroke. When training, the networks are reset before a new sequence is shown. Finally, we did not add any distinguishing mark for different characters and we used all the characters to train a single model. This is because we wanted the neural networks to learn a general motion model which is suitable for all the character writing strokes.
For this test, we used one layer of LSTM, with cell state of size and the activation function. This is followed by a fully connected linear layer which produces the output. The Mean Squared Error (MSE) is used as a cost function for backpropagation. The model is trained for epochs, with a batch size of 16 sequences which are zeropadded for uniformity.
As for the DyBM^{1}^{1}1https://github.com/ibmresearchtokyo/dybm, we used the linear version as the base with three different variations. Firstly, we trained it only offline with the training data. This serves as a comparison with the LSTM, which can only be trained offline for our application. Secondly, we allowed the DyBM to use the testing data for online learning. This is the normal usage of the DyBM. Finally, we added an echo state network (ESN) [12], with size and leak parameter , to the DyBM. This should enhance the nonlinearities it can learn while still being fast enough for online learning.
4.1 Neural network inference results
To serve as a baseline for evaluating the results, we used the simplest sensible prediction which is to assume that the velocity will remain constant. A similar model was used in [8] as a baseline for predicting human motion. Table 4.1 shows a summary of the results on the testing set. We are using 3 metrics: first the Mean Squared Error (MSE) over the whole validation set. Next, we do a PerSequence (PS) comparison. PSB is the percentage of sequences having an MSE better than the baseline. PSLSTM is similar but compared against the LSTM.
algorithm  MSE  PS  B  PS  LSTM 

baseline  3.0875  —  33% 
LSTM  3.7132  67%  — 
DyBM offline  3.2483  39%  31% 
DyBM online  2.7151  79%  39% 
DyBM online and ESN  2.2715  90%  42% 
We can see that for the mean squared error (MSE), the LSTM model and the DyBM trained only offline are both worse than the baseline. However, the DyBMs with online learning are both better. The MSE here is just an indicator of the general model. To investigate further, we did persequence comparisons. All the models except for DyBM with offline training are better than the baseline in more than of the 8136 validation sequences. The results here are expected for the DyBMs but somewhat surprising for the LSTM which had a high overall MSE. In checking this further, we observed that the sequences for the same symbols exhibit similar results. The LSTM performed worse in simpler, straigther symbols such as “v”, “1”, “” but it was better in more complex, curvier symbols such as “p”, “b”, “0” or those with discontinuities from parsing the penuppendown event like “K”. Since the simple baseline should provide a good approximate for the simple symbols, it is better in these cases. Because the LSTM showed a good performance in the more difficult characters, this led to the comparison of PSLSTM which is still persequence but against the LSTM. In this column, we see that the other methods overachieved LSTM only in less than of the sequences, although the online DyBMs are close at around .
As a summary, the LSTM has learned a highly nonlinear model which generalizes to different character strokes but at the cost of being much worse in simple character strokes leading to a high overall MSE. The DyBM trained only offline performs poorly across all metrics, but was not intended to be used in such manner. The online DyBM has learned a general model (high MSE, high PSB). It is better than the LSTM for simple characters but worse for complex characters. The online DyBM with ESN is the best considering overall performance, but it is still slightly worse than the LSTM on complicated characters.
As for training speed, the LSTM was trained with a batch size of 10 and took around 215 seconds per epoch, while the plain DyBM took around 43 seconds per epoch and with ESN around 54 seconds per epoch. Although not as high as for the dataset reported in [11], we see that it is still significantly faster.
4.2 Results of the complete framework
This subsection reports the results on testing the complete framework on simulations of a UR5 robot. Fig. 1 shows some results of the task. For reference, the grid in Fig. 1 has a spacing of 0.1 m. For comparisons of how much the tracking error can be improved, we used a sequence for the letter K as a representative of the results where the baseline performs poorly in terms of MSE. The sequence, taken from the validation set, is played online to represent the human writing the letter. The robot task is to try to write the letter together with the human at the exact same time. To control the robot, the MPC is used to generate the writing motion. This is then used as an endeffector command. Joint trajectory commands are obtained from this by using another QP for doing inverse kinematics, which handles the joint limits.
Fig. 2 shows a comparison of writing the same sequence in three different ways. First, only the feedback component was used to give a baseline. Secondly, we used one of the trained NN model’s predictions while using the feedforward term all throughout. Lastly, a perfect prediction can be done by using the test sequence in the feedforward term. Although this is practically impossible when the system runs online, it provides an ideal comparison point for the tests here. We can see that the feedbackonly case resulted in a tracking error going up to cm. The mean squared tracking error was about . In comparison, we can see a significant improvement by “with prediction” which used the LSTM with the preview horizon of length as a feedforward network for the MPC. Its mean squared tracking error was about . This is an order of magnitude better than the feedbackonly case. Finally, we compare this result to a perfect prediction, whose mean squared tracking error is about . In this ideal prediction case, the error comes from a combination of the preview horizon (optimizing only on a limited time horizon instead of giving the full trajectory at once), the lowlevel robot motion controllers and the smoothing term of minimizing the jerk. The important point here is that using the NN for the feedforward term can result in tracking errors of the same order of magnitude as the perfect prediction case.
The final test is on using the weights to switch smoothly from feedback only to feedforward. The purpose of this test is to verify that there are no adverse effects due to the switching. The same sequence as those in Fig. 2 was used. The resulting tracking error is shown in Fig. 3. The weight was linearly decreased from to during time step until . Fig. 3 shows no irregularity during this period where the error decreased as expected.
5 Conclusion
In this paper, we presented a framework that can predict human motions by using different memorybased neural network models and then effectively use these to produce an anticipatory action by using an MPC. Furthermore, separate feedback and feedforward terms were designed to be able to cope with cases when the prediction is unreliable. Finally, we also demonstrated that it is possible to switch between the feedback and feedforward objectives seamlessly. The results show that the presented framework is an effective control strategy for human motion control tracking tasks. Future works on using the same framework for various applications are planned.
References
 [1] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “Endtoend Training of Deep Visuomotor Policies,” J. Mach. Learn. Res., vol. 17, no. 1, pp. 1334–1373, Jan. 2016.
 [2] I. Lenz, R. A. Knepper, and A. Saxena, “DeepMPC: Learning Deep Latent Features for Model Predictive Control.” in Robotics: Science and Systems, 2015.
 [3] C. Finn and S. Levine, “Deep visual foresight for planning robot motion,” in IEEE International Conference on Robotics and Automation, May 2017, pp. 2786–2793.
 [4] Z. Erickson, H. M. Clever, G. Turk, C. K. Liu, and C. C. Kemp, “Deep Haptic Model Predictive Control for RobotAssisted Dressing,” ArXiv eprints, Sep. 2017.

[5]
G. Williams, N. Wagener, B. Goldfain, P. Drews, J. M. Rehg, B. Boots, and E. A. Theodorou, “Information theoretic MPC for modelbased reinforcement learning,” in
IEEE International Conference on Robotics and Automation, May 2017, pp. 1714–1721.  [6] J. J. LaViola Jr. and R. C. Zeleznik, “A Practical Approach for WriterDependent Symbol Recognition Using a WriterIndependent Symbol Recognizer,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 11, pp. 1917–1926, Nov 2007.

[7]
T. Osogami and M. Otsuka, “Seven neurons memorizing sequences of alphabetical images via spiketiming dependent plasticity,”
Scientific reports, vol. 5, 2015. 
[8]
J. Martinez, M. J. Black, and J. Romero, “On human motion prediction using
recurrent neural networks,” in
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)
, Jul. 2017.  [9] P. Vinayavekhin, S. Chaudhury, A. Munawar, D. J. Agravante, G. D. Magistris, D. Kimura, and R. Tachibana, “Focusing on What is Relevant: TimeSeries Learning and Understanding using Attention,” in 2018 24th International Conference on Pattern Recognition (ICPR); to appear, Aug 2018.
 [10] S. Hochreiter and J. Schmidhuber, “Long ShortTerm Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [11] S. Dasgupta and T. Osogami, “Nonlinear Dynamic Boltzmann Machines for TimeSeries Prediction.” in aaai, 2017, pp. 1833–1839.
 [12] H. Jaeger and H. Haas, “Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication,” Science, vol. 304, no. 5667, pp. 78–80, 2004.
Comments
There are no comments yet.