Robotics has positively transformed human lives and work, raise efficiency and safety as well as provide enhanced services. Especially in the restaurant and cooking environment, where safety and sanitation are essential, the cooking robot comes in handy. To ensure the robot can achieve human accuracy in performing the motions and have the ability to adjust according to the environment, such as cooking materials and object states, much work has been done on manipulation motion taxonomy , different grasp taxonomies, and grasp types [2, 3, 4, 5] to allow a robot to ”understand” and execute the proper motion.
Not only motion taxonomy, but cooking robots are also required to have detailed knowledge on various manipulation tasks in order to successfully perform kitchen activities, and accuracy has a key effect on the resultant product. Based on the Functional Object-Oriented Network (FOON) video set, pick-and-place is the most frequently executed motion, and pouring is the second most [6, 7]. For pick-and-place, researchers have used visual perception on a robot-arm system for carrying out flexible pick and place behavior  and used a task-level robot system to carried out dozens of such operations that involving various complex environments . For this project, we focused on the pouring motion, several researchers have examined different pouring factors and approaches to increase the robot pouring skill and accuracy. For instances, using stereo vision to recognize liquid and particle flow during pouring 
, applying two-degrees-of-freedom-control to control the liquid level
, using parametric hidden Markov models to allow force-based robot learning pouring skills, and using an RGB-D camera to track the liquid level .
Besides using vision methodology, recurrent neural network (RNN) has been increasing in popularity for sequence learning and generation. Several studies used RNN to model liquid behavior  and model pouring behavior[9, 10, 11].
RNN is a type of artificial neural network that inherently suitable for sequential data or time-series data. In RNN, the connection of the units is formed as a directed graph along the sequence, which allows it to exhibit dynamic temporal behavior for the time sequence. The core idea of RNN is to use the information from the previous time step in the sequence to produce the current output, and the process will continue until all sequences are given input. Unlike other neural networks, in RNN, all the inputs are related to each other. Fig. 1
gives an illustration. At the last step, RNN has all the information from the previous sequences to produce predictive results. There are various types of RNN and each has its advantages and disadvantages. Couple studies on pouring motion used Peehole Long Short-Term Memory[9, 10, 11]. Hence, for this project, three other common types of recurrent neural networks, simple RNN, LSTM, and Gated Recurrent Units (GRU), are experimented with to model the pouring behavior. Mechanisms are explained in section III-A.
Ii Data and Preprocessing
The dataset contains a total of 688 pouring sequences and their corresponding weight measurements. Each motion sequence has seven feature dimensions, which for each timestamp of a motion sequence are
Only , and
are changing with time, the other four sequences are constant throughout the entire sequence. The length of sequences is various, as all sequences are padded with zeros at the end according to the maximum length of the sequence, which in our case is 700. The purpose of zero post-padding is so that all sequences in a batch can fit in a standard length. Masking is used during the training to exclude padded zeros when computing the loss. Fig.2 gives a simple illustration of given dataset. The detailed data collection process could be found in .
Ii-a Data Preprocessing
Input of the network contains a total of six features, with output of the network being one dimension ,
Although, features , , , stay constant throughout the time, those features still have affect on the target and this notion has been proved in several experiments on different combination of the input features.
Before feeding the data into a neural network, input features are being normalized to speed up the learning process which leads to faster convergence. Two common data normalization methods are min-max normalization and standardization.
Min-max normalization retains the original distribution of values except for a scaling factor and then transforms all the values to the common range of 0 and 1. However, this technique is not robust due to the high sensitivity to outliers and uncertainty of features of the test set. Therefore, a standard scale is being used to normalize the input features. The input features are being standardized independently on each feature by removing the mean and scaling to unit variance, the formula is shown in Eq.3, where is the mean of the training sample, and
is the standard deviation of the training samples.
Only real data are being normalized whereas zero-padding remind the same. Fig. 3 gives a simple illustration of standardized input data.
The data are being shuffled and randomly split into 80%, 550 trials, for training and 20%, 138 trials, for validation. The test dataset is reserved by the TA and instructor for model testing. The same normalization scale from the training dataset is being applied to the validation and test set.
Iii-a RNN Architectures
Simple Recurrent Neural Network (Simple RNN) The simple RNN has a short-term memory problem due to vanishing and exploding gradient. To put it differently, simple RNN has a difficult time solving a problem that requires learning long-term temporal dependencies, therefore hampering learning of long data sequences as it processes more steps. The gradient is used to update the parameters in the network, and when the gradient becomes smaller and smaller, the parameter updates becomes insignificant which results in the network not learning from the earlier inputs.
The mechanism of simple RNN is illustrated in Fig. 4 (a) and is written as:
is the given input vector,is the output vector, is output from the previous step, is the output at the current step, and is the weight parameter.
Long Short-Term Memory (LSTM)
LSTM is one of the most popular RNN that overcomes the vanishing gradient problem in back-propagation. The mechanism of LSTM is illustrated in Fig. 4 (b) and is written as:
where , , are the input, output, and forget gates respectively, is the cell state, is the candidate cell,
is the sigmoid activation function, andrepresents the pointwise multiplication.
LSTM contains both cell states and hidden states, where the cell state has the ability to remove or add information to the cell and maintain the information in memory for long periods of time. The introduction of gating mechanism in LSTM
Input gate : Update the cell status.
Forgot gate : Decides how much information from the previous state should be kept and what information can be forgotten.
Output gate : Determines the value for the next hidden state, which contains information o previous inputs.
allows better control over gradient flow and better preservation of long-term dependencies.
Gated Recurrent Units (GRU) GRU is another popular RNN that is intended to solve the vanishing gradient problem. GRU contains only two gates, reset gate and update gate, and it is less complex than LSTM for that reason. The mechanism of GRU is illustrated in Fig. 4 (c) and is written as:
GRU shares many common properties with LSTM, where gating mechanism is also used to control the memorization process. GRU contains two gates, which are
Update gate : Decides whether the cell state should be updated with the candidate state.
Reset gate : Decides whether the previous cell state is important or not.
By comparing GRU with LSTM, one can observe that GRU [Equ. 12-13] is similar to LSTM [Equ. 6-7]. However, GRU requires less memory, is significantly faster to compute than LSTM due to GRU, uses fewer training parameters, and uses fewer gates.
Iii-B Proposed Architecture
The initial model experimented with was a proposed architecture by , which consists of a total of 4 layers, with each layer including 16 LSTM cells. After trying and training different recurrent neural networks with a varying number of layers and units, GRU gave better results than simple RNN and LSTM. The final model architecture is composed of a total of seven GRU layers and one fully connected layer, where each of the GRU layers returns a full sequence due to the following GRU layer needing a full sequence as the input. The mechanism of the GRU unit is explained in Section III-A. Visualization of the architecture can be found in Fig. 5.
The resultant model contains a total of 83,537 parameters and they are all trainable parameters. The first seven layers of the networks are GRU layers, with three layers of 64 GRU cells, two layers of 32 GRU cells, and two layers of 16 GRU cells. Sigmoid activation is applied on each of the gates, update and reset gates, that is present in GRU, where value is in the range of 0 and 1. It is important to update and forgot data because any value multiplied by 0 is 0, which allows this data to be ”forgotten”, and any value multiplied by 1 is the value itself, allowing those data to be ”kept”. Therefore, the sigmoid function allows the network to learn the necessary information only.is used as activation and is commonly used in RNN to overcome the vanishing gradient problem, where a function whose second derivative can sustain for a long-range before going to zero is need. function squishes the values between -1 and 1 to regulate the output of the neural network.
The last layer in the network is a fully connected layer, which reduces the output dimension to one. Dropout is not used and detailed analysis is explained in Section IV-E.
Iii-C Loss Function
For regression problems, the Mean Squared Error (MSE) is commonly used as a loss function for evaluating the performance. MSE is the mean overseen data of the squared differences between true and predicted values. The squaring is critical to reducing the complexity with the negative signs. MSE is defined as:
where is a number of data points, is observed values, and is predicted values.
Other loss functions such as Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) are also applied to the proposed models during the training, but neither perform better than MSE when compared the result in the same metric. The metric that used for this paper is MSE.
Iii-D Model Setting
The best setting for the model is listed in Table I. The model with the lowest validation loss was selected as the best pouring dynamics estimation model.
|Optimizer||adam (, , )|
|Batch Size||Default 32|
|Initial Learning Rate|
|Learning Schedule||Constant Learning Rate|
Number of epochs
Iv Evaluation and Results
Iv-a Simple RNN, LSTM, and GRU
To determine the best RNN for the pouring dataset, the initial architecture is trained on the simple RNN, LSTM, and GRU individually for 500 epochs. The training is done using the Adam optimizer with a learning rate of .
As a result, the initial architecture achieves an MSE of and on simple RNN and LSTM respectively, whereas GRU resulted in the best error rate of . From three loss graphs shown in Fig. 6, it is observed that loss for simple RNN converged right after the first 20 epochs and then stop decreasing, where a loss for LSTM was frequently fluctuating and unstable. Finally, GRU has a slightly better loss graph, where validation loss decreased continuously on the first 100 epoch but stabilized right after. Aside from error rates, GRU also has a faster training speed compared to simple RNN, which is much slower when training on this architecture.
GRU has been shown to exhibit better performance on smaller and less frequent datasets than LSTM, where LSTM surpasses GRU on larger datasets . In our case, only 550 sequences are used for training, which is quite small for the deep neural network. More experiments on GRU and LSTM are conducted in Section IV-B.
Iv-B Number of Cells and Layers
Many of the studies used a constant amount of internal cells throughout the RNN layers and keep the number layers within 1 to 4 [8, 9, 10, 11]. To examine the effect of the multi-layer network that contains different amounts of cells, several experiments are conducted by increasing or reducing the cells as it goes down the RNN layers.
From Table II, it is observed that increasing the number of cells (Design 1-3) and the number of layers (Design 4-8) does yield a positive impact on the results, especially when using the GRU. Although the improvement is not significant, it shows that model complexity has an effect on the results. By comparing the loss value between GRU and LSTM, we see that GRU has a better performance in various architecture designs, except for Design 5 where the number of cells is increasing as it goes down the layers. More studies are needed to prove this notion.
Design 8, the proposed architecture, is the most complex of the architectures and it resulted in the lowest loss when using GRU, although LSTM did not benefit when adding the additional layer with 64 units from Design 7. Based on the observation, LSTM is frequently fluctuating, which could be the reason causing the network to perform the worst within the fixed epochs.
As mentioned, training on the same architecture, GRU is significantly faster than LSTM and LSTM is less stable than GRU. Those notions are again proved in the experiments. Therefore, consider both performance and computational cost, GRU is concluded as the best recurrent neural network for this application.
|Design||Number of layers||Number of cells at each layer||GRU training loss||GRU validation loss||LSTM training loss||LSTM validation loss|
|4||4||16, 16, 16, 16|
|5||4||8, 16, 32, 64|
|6||4||64, 32, 16, 8|
|7||6||64, 64, 32, 32, 16, 16|
|8||7||64, 64, 64, 32, 32, 16, 16|
Optimizer plays an important role in the neural network. A good optimizer can significantly reduce the loss and provide the most accurate results. By far, the most popular algorithms to perform optimization is SGD and Adam. In this experiment optimizers Adadelta, Adagrad, Adamax, and RMSprop have experimented. Training are done on a large number of epochs, 1000, in case some optimizers present a slower convergence. Default settings for each optimizer, see TableIII, are used with a learning rate of .
|Adagrad||, initial accumulator value =|
|RMSprop||, momentum = ,|
|SGD||momentum = , nesterov = False|
From Fig. 7, one can observe that Adam, Adamax, and RMSprop have relatively the same behavior and validation loss, although Adam and Adamax are much less fluctuating than RMSprop. SDG and Adagrad show an interesting trend, where validation loss has a visible decrease after several hundred epochs of unchanged, around 300 epochs for SGD and around 600 epochs for Adagrad. Compare with other optimizers, Adadelta’s convergence rate is much slower and has a much higher validation loss. Of course, with different initial learning rates and decay schedules, some optimizers might have behaved better than others. However, based on the current setting, Adam is chosen as the best optimizer for the pouring dynamics RNN.
Iv-D Learning Rate
The learning rate controls how quickly a model is adapting to the problem. A smaller learning rate may allow the model to learn a more optimal set of weights, but it would require longer training time, whereas a larger learning rate will cause the model to converge too quickly to a suboptimal solution. Three common learning rate schedules have been experimented with: constant, step decay, and exponential decay. To observe the behavior of various changes on the learning rate, the same setting is applied to other hyperparameters. Model is being trained for 500 epochs and the lowest validation loss is being recorded in Table IV.
One can observe that exponential decay performs worst out of three schedules while step decay performs better in general but does not take much advantage from lowering the learning rate. If the number of the epoch is increased, a smaller learning rate will cause very little to no updates to the weight in the network. Finally, a constant learning schedule with a rate of surprisingly outperformed on both training and validation. Therefore, to allow the network to have a sustainable learning process, a constant learning rate is used.
|step decay 0.5 every 10 epochs|
|step decay 0.5 every 20 epochs|
|step decay 0.5 every 30 epochs|
|step decay 0.5 every 40 epochs|
Dropout is commonly used to reduces overfitting. For RNN it is important not to apply dropout on the connection that conveys time related information . Experimentally, dropout rates 0.2, 0.1, and 0.05 were applied right after the last GRU and LSTM layer on the proposed architecture. As a result, the network performed worse on both training and validation sets. Therefore, no dropout is used in the final model. However, to avoid significant overfitting, the model is trained on the fixed epoch size, 1500 epochs, and the model with the lowest loss is kept. More importantly, GRU is less prone to overfitting since it only has two gates while LSTM has three, thus, dropout becomes less necessary for GRU.
Iv-F Result and Observation
By training the model using the same setting multiple times, the proposed model is able to achieve MSE in the range of to on the validation dataset, which is relatively low and stable.
The prediction and ground truth comparison graphs on validation dataset are shown in Fig. 8, where x-axis is the time step and y-axis is the weight with unit of . Based on the observation, it can conclude that the model is able to learn and predict the general patterns, i.e. changing gradually in respect to time, in advance for the pouring motion. More importantly, the model is able to predict the change of the amount of water in the pouring cup accurately, where the blue line (ground truth) and the red line (network output) are close to perfectly match up. However, upon looking at the dataset, we observed that there is an outlier, possibly more, in the dataset (last sample in the first row of Fig. 8), where the first 50 time steps show a small drop and rise in weight . This will cause the model to make an incorrect prediction on such a pattern around those time steps due to being unseen or rarely seen in the training dataset.
In this pouring dynamic estimation project, various recurrent neural networks have been investigated with different hyperparameter settings to estimate the change in the amount of water in the pouring cup to the sequences of pouring motion. Experimentally, it is found that GRU outperforms LSTM in both computational cost and performance based on the pouring dataset. In addition, the evidence shows that by training the exact same architecture using the same setting, the network prediction results in some variances, within the range of , even with no dropout applied.
The proposed model achieved an MSE as low as , which is a decent loss value. Based on the given dataset, it might reach its limitation already. However, there is some restriction in the experiment that may influence the potential of the model to achieve a lower error. The given dataset is a bit small for the deep neural network, hence, by increasing the number of trials, the model would be able to learn more features and patterns from the data. Additionally, the dataset should also contain more variants by increasing pouring data trials with different environment settings, materials, and other related features, such as rotation angle , velocity and various geometric of the cups. By doing so, the model can learn more variation directly from the dataset, which can result in a more robust network. There are plenty of researches going on for generating a dynamic response of motion sequences. The results of the robust model allow robotics to achieve human accuracy in executing the motions.
For future work, other types of RNN can experiment, such as continuous-time RNN (CTRNN), recurrent multi-layer perceptron network (RMLP), and multiple timescales RNN (MTRNN), etc. Different RNN architectures could result in different behaviors.
-  Paulius, David, et al. “Manipulation Motion Taxonomy and Coding for Robots.” ArXiv.org, 31 July 2020, arxiv.org/abs/1910.00532.
-  M. R. Cutkosky, “On grasp choice, grasp models, and the design of hands for manufacturing tasks,” IEEE Transactions on robotics and automation, vol. 5, no. 3, pp. 269–279, 1989.
-  F. Worg otter, E. E. Aksoy, N. Kr ¨ uger, J. Piater, A. Ude, and M. Ta- ¨ mosiunaite, “A simple ontology of manipulation actions based on hand-object relations,” IEEE Transactions on Autonomous Mental Development, vol. 5, no. 2, pp. 117–134, 2013.
-  I. M. Bullock, R. R. Ma, and A. M. Dollar, “A hand-centric classification of human and robot dexterous manipulation,” IEEE transactions on Haptics, vol. 6, no. 2, pp. 129–144, 2013.
-  T. Feix, J. Romero, H.-B. Schmiedmayer, A. M. Dollar, and D. Kragic, “The GRASP taxonomy of human grasp types,” IEEE Transactions on Human-Machine Systems, vol. 46, no. 1, pp. 66–77, 2016.
-  David Paulius, Yongqiang Huang, Roger Milton, William D Buchanan, Jeanine Sam, and Yu Sun. Functional object-oriented network for manipulation learning. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2655–2662. IEEE, 2016.
-  David Paulius, Ahmad B Jelodar, and Yu Sun. Functional object-oriented network: Construction & expansion. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–7. IEEE, 2018.
-  Tianze Chen, Yongqiang Huang, and Yu Sun. Accurate pouring using model predictive control enabled by recurrent neural network. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019.
-  Yongqiang Huang and Yu Sun. Learning to pour. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7005-7010. IEEE, 2017.
-  Juan Wilches, Yongqiang Huang, and Yu Sun. Generalizing learned manipulation skills in practice. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9322-9328, 2020.
Yongqiang Huang, Juan Wilches, and Yu Sun. Robot gaining accurate pouring skills through self-supervised learning and generalization. Robotics and Autonomous Systems, 136:103692, 2021.
-  “Task-Level Planning of Pick-and-Place Robot Motions.” IEEE Xplore, ieeexplore.ieee.org/abstract/document/16222.
-  S. Saravana Perumaal, N. Jawahar. “Automated Trajectory Planner of Industrial Robot for Pick-and-Place Task - S. Saravana Perumaal, N. Jawahar, 2013.” SAGE Journals, journals.sagepub.com/doi/full/10.5772/53940.
-  “Stereo Vision of Liquid and Particle Flow for Robot Pouring.” IEEE Xplore, ieeexplore.ieee.org/abstract/document/7803419.
-  Sugimoto, Yu, et al. “LIQUID LEVEL CONTROL OF AUTOMATIC POURING ROBOT BY TWO-DEGREES-OF-FREEDOM CONTROL.” IFAC Proceedings Volumes, Elsevier, 25 Apr. 2016, www.sciencedirect.com/science/article/pii/S1474667015395951.
-  “Force-Based Robot Learning of Pouring Skills Using Parametric Hidden Markov Models.” IEEE Xplore, ieeexplore.ieee.org/abstract/document/6614613.
-  Do, Chau. “Accurate Pouring with an Autonomous Robot Using an RGB-D Camera.” SpringerLink.
-  Yongqiang Huang, Yu Sun. “A Dataset of Daily Interactive Manipulation - Yongqiang Huang, Yu Sun, 2019.” SAGE Journals, journals.sagepub.com/doi/full/10.1177/0278364919849091.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  “LSTM and GRU Neural Network Performance Comparison Study: Taking Yelp Review Dataset as an Example.” IEEE Xplore, ieeexplore.ieee.org/abstract/document/9221727.
-  Zaremba, Wojciech, et al. “Recurrent Neural Network Regularization.” ArXiv.org, 19 Feb. 2015, arxiv.org/abs/1409.2329.