Finger Grip Force Estimation from Video using Two Stream Approach

03/05/2018 ∙ by Andrey Sartison, et al. ∙ Skoltech 0

Estimation of a hand grip force is essential for the understanding of force pattern during the execution of assembly or disassembly operations. Human demonstration of a correct way of doing an operation is a powerful source of information which can be used for guided robot teaching. Typically to assess this problem instrumented approach is used, which requires hand or object mounted devices and poses an inconvenience for an operator or limits the scope of addressable objects. The work demonstrates that contact force may be estimated using a noninvasive contactless method with the help of vision system alone. We propose a two-stream approach for video processing, which utilizes both spatial information of each frame and dynamic information of frame change. In this work, image processing and machine learning techniques are used along with dense optical flow for frame change tracking and Kalman filter is used for stream fusion. Our studies show that the proposed method can successfully estimate contact grip force with RMSE < 10 N), the performances of each stream and overall method performance are reported. The proposed method has a wide range of applications, including robot teaching through demonstration, haptic force feedback, and validation of human- performed operations.



There are no comments yet.


page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Automated disassembly is one of the enabling technologies for sustainable development because it is a clean and energy-efficient way to recover valuable materials from e-waste. Limited landfill capacity, unstable material prices, and growing labor costs make automation feasible and raise interest in the field, which can be demonstrated by the development of Liam [1], an automated line for iPhone 6 thorough disassembly. It consists of more than 20 tailored robots, which step by step executes the disassembly sequence. However, this approach is principally limited only to large-scale mass-produced products. Our goal in RecyBot, a joint Skoltech-MIT project, is to develop a universal high-speed intelligent robotic system for electronics recycling. RecyBot system consists of several robots, each tailored to perform a specific subset of tasks, whose joined target is to disassemble smartphones to the component level and enable material recovery.

With the current presence of collaborative robots, force feedback is increasingly commonly used in many automation applications, such as gentle object handling and insertion operations with tight margins, where active compliance is essential. In RecyBot these include, for example, screen disassembly, clamp removal, manipulations with fragile and flexible components, and cable disconnection. A significant number of their varieties encourages usage of machine learning methods to deduce the position and the compliance of the end effectors for the operations, but this approach is limited by the absence of the relevant dataset.

To create such a dataset an invasive instrumented approach may be used [2, 3, 4], when disassembly tools are equipped with a force sensor and markers and the procedure of smartphone disassembly by a skilled human with this tools is recorded. The data, collected using this approach usually has a small statistical error, but big efforts are required to collect it and systematic errors may be present due to invasiveness or other restrictions, like the ability measure force only at specific points of manipulated object [5]. Other studies require markers to be applied to a hand or a part of a hand [6, 7].

On the other hand, noninvasive approach, when position and force are estimated from videos strongly benefits from the vast number of smartphone disassembly videos available online. The position estimation in this approach is well studied [8, 9]. State of the art algorithm [9] uses deep convolutional architecture combined with the implicit spatial modeling. In the same time, gripping force measurements remain a challenging task. It is a combination of dynamic forces induced by motion of hand with an object and force, caused by fingers pressure pattern itself. In our work, it is assumed that dynamic forces111See [5] for details. are negligible as our interaction object remains fixed.

To fully exploit recent advance of robots learning by showing [10, 11], we propose to add force profile measurement to existing hand-tracking systems. Current studies describe methods of obtaining information on fingers position during the grasp [8, 9, 12], but a very limited set of works emphasize on actual force information. A prominent class of works is based on the property of human fingernail to change its color when force is applied to a tip [4, 7, 13]. In [13]

authors used Convolutional Neural Network (CNN) to find a finger alignment and texture mapping to produce a properly aligned representation of a fingernail. Gaussian Process based on aligned images was used to estimate force and torque applied to a fingertip. This method demonstrated good results. However, it has strict limitations on camera position: the camera should always face a nail.

In our approach, only video information is used to estimate gripping force at any interaction point. It reduces limitations on camera position relative to hand in comparison with [4], [6],[5] and utilizes dynamic information of hand movement to enhance the precision.

In this work, we will concentrate only on normal gripping force estimation of a two fingers grip. The proposed algorithm is a two-stream approach, with a temporal and a spatial stream. Firstly, this paper describes a hardware setup for collecting a dataset. Then we cover algorithm implementation details: signal filtering, image preprocessing techniques, machine learning model and data fusion. In the end, we provide algorithm’s work results.

Ii Methodology

In this section, we describe the hardware setup used to acquire the data and the approach used to process that data to estimate gripping force.

Ii-a Hardware setup

The video and force measurements recording setup is shown on Fig. 1.

Fig. 1: Hardware setup

The special stand keeps the camera stationary, and rigid tweezers model fixed relatively to the camera. The rigid tweezers model was 3D printed. It also includes a round marker of distinguishing yellow color. Canon EOS Rebel SL2 camera is used for shooting videos. Videos have Full HD (1920x1080) resolution and 59.95 fps frame rate. A pair of Force Sensing Resistors FSR 402 by Interlink Electronics is used as a force sensor. Green LED is used to synchronize video and force measurements: it lights up when measurements recording starts and goes down after 100 seconds. Arduino UNO is used for analog to digital conversion of FSR signals and transmission of the signal to PC for further processing. The frequency of measurements reading is 100 Hz. Each measurement is presented in the form of two raw 10 bits’ measurement (for left and right sensor respectively) with millisecond timestamp.

Ii-B Two stream architecture for force estimation

Our architecture (Fig. 2) is based on a hypothesis of two stream human perception [14]

. According to this hypothesis, human vision system consists of two ‘broad’ streams of projection: a ’ventral stream’ of projections which is responsible for object recognition, and a ’dorsal stream’ that processes the action information. This approach demonstrated good results in computer vision tasks, related to action recognition


Fig. 2: Two stream architecture

In our approach, we naturally divide the video into two separate components. One component is spatial, and it is represented as a set of independent sequential frames. Another element is temporal, and it relates to changing information on sequential frames. The method was constructed accordingly and has two corresponding signal paths or streams.

Using the static information in the form of separate frames is an obvious approach, and it has already been previously used for similar purpose [2]. It was shown that a spatial stream on its own can demonstrate good results for this task. But, as it will be revealed later in Sect. III, the temporal stream may perform better than spatial and dramatically increase the accuracy of estimation.

The overall method is divided into three steps: general image preprocessing, stream processing and stream fusion. Each stream is implemented as a process which consists of two components: preprocessing function and machine learning model. Each stream has a different preprocessing part, but similar machine learning phase. Kalman filter was used to combine information from each stream to get the final estimation of force applied.

We rationally assumed that absolute value of a current amount of force can be estimated using spatial stream, and dynamic information of force (the derivative of force applied) can be estimated using temporal stream. It is also assumed, that merging two streams will make the model more accurate and robust compared with each independent stream.

Ii-C General image preprocessing

General image preprocessing was done for every frame of a video (Fig. 2(a)) before feeding it to spatial and temporal streams. The preprocessing phase aimed to mask all the irrelevant parts of a frame and cancel big movements of the camera and a hand. Rotation, translation and scale components of movements were calculated and canceled. This actions allowed to uniformly represent frames independently of small hand position and orientation changes.

During the preprocessing phase the following actions were performed:

  1. Segmentation of skin

  2. Marker recognition

  3. Scale, translation and rotation calculation, and compensation

The segmentation of skin (Fig. 2(b)) was done using naive computer vision algorithm. Threshold filter was applied to roughly separate skin from other objects and background. Several iterations of successive morphological operations were applied to the thresholded image to filter out the noise and refine the edge.

Two markers were detected on each frame to ease image processing. The first marker was applied to a tool itself and had a distinct yellow color. This marker was located between two Force Sensitive Resistors. This marker was segmented using threshold and morphology filters (Fig. 2(c)). The second marker was calculated as a center of mass of skin region (Fig. 2(b)), and for the used grip pattern it was approximately located at the point between Metacarpophalangeal joints of an index finger and a thumb. The central marker was found in the middle of the line, which connected first and second markers (Fig. 2(c)). Those markers were calculated for rough and fast estimation of frame scale and rotation.

Using position of marker one and marker two, scale, transition and rotation operations were applied to achieve following conditions:

  • The distance between marker one and the central marker has unit length,

  • Axis X starts at marker one and points toward marker two

  • Marker one has coordinate.

(a) Initial frame of a video
(b) The frame after skin segmentation. Only skin is shown on this frame
(c) The frame with canceled rotational and translational movements
Fig. 3: Markers parsing and motion canceling

Ii-D Spatial stream implementation

The spatial stream is used to estimate the absolute amount of gripping force applied to an object. The spatial stream may be considered as a robust standalone method. However, as it will be shown later, it does not demonstrate sufficient accuracy on its own. Nevertheless, it is an essential part of the overall method. It allows eliminating accumulated error which may be introduced by integration of spatial stream output.

The overall scheme of the spatial stream is illustrated in Fig. 4.

Fig. 4: Spatial stream architecture

The basic idea behind the proposed method is similar to [2, 13]: we exploit the fact that force applied to a fingertip causes skin color change. However, the previously proposed method is based on nail color change. That requires an additional device to be installed on a nail or a nail to be clearly visible by the camera during operation. We mostly concentrate on skin color change instead, as the skin is more likely to be visible.

Spatial stream preprocessing function uses an output of general image preprocessing algorithm. Region with fingertips is cropped (Fig. 4(a)) from obtained image and skin mask is applied (Fig. 4(b)).

(a) Region of the preprocessed frame with fingertips cropped
(b) Cropped image with skin mask applied
Fig. 5: Fingertips region of a frame

The region of interest is determined as a rectangular region around marker one with constant side lengths width and height . This length was chosen as it includes only the distal phalanges of an index finger and a thumb. This phalanges primarily stayed immobile during the operation. Moreover, the main skin color change is constrained in the chosen region.

The relative histogram of this image in HSV space is computed based on cropped and masked image (Fig. 4(b)). The histogram is shown in Fig. 6. For each separate channel, a histogram was computed. Each histogram has 20 bins. Limits for histograms were chosen in a way to utilize the range of the significant part of histogram more. Combination of three histograms forms 60 features for further processing.

(a) Hue histogram
(b) Saturation histogram
(c) Value histogram
Fig. 6: Histogram for masked fingertips region

Ii-E Temporal stream implementation

The temporal stream is used to estimate the first derivative of grip force. This method, as it will be shown later, demonstrates sufficient accuracy, but in some circumstances, it may have an accumulated error due to integration during fusion. The scheme of temporal stream implementation is shown in Fig. 7. The method is based on dense optical flow [16] image processing. It utilizes an assumption that optical flow distribution pattern of clenching hand correlates with the gripping force difference.

Fig. 7: Temporal stream architecture

During the preprocessing phase, dense optical flow of the initial video is calculated using implementation from [17] with the following parameters: number of pyramids , , , , ,

. The optical flow is further processed. Rotation, translation and scale operations, obtained during the general preprocessing step, are applied to optical flow to cancel big movements and align optical flow respectively. Skin mask from general preprocessing step is used to zero out non-hand movements, such as movements of background objects, movements of the camera. For the masked optical flow matrix mean along all the image for each component is computed. A two-component vector is obtained and subtracted from each non-masked component of optical flow. This step allows minimizing solidary movement of a hand further. The output of this step is referred to as rectified optical flow.

Rectified optical flow vectors are converted to polar coordinate space. Based on vector direction, vector magnitude sign is determined: if vector points towards the middle line, it gets positive sign, otherwise negative. More formally it can be written as (1).



is the new magnitude of the vector

is the old magnitude of the vector

is the y position of the vector

is the angle of the vector orientation

is the y position of the central marker

The motivation is derived from an assumption, that if optical flow vector is pointed toward the middle line, hand clenches and this vector adds up to overall force change. Otherwise, the hand unclenches and the total force change decreases.

Optical flow vectors are combined into a two dimensional histogram (Fig. 7(b), 7(d)), with one axis being directions of vectors and the other axis being magnitudes of vector. Histogram limits are chosen in such a way, that of all the optical flow vectors found in our measurements fall within the limits.

(a) Optical flow of clenching frame
(b) 2D histogram of clenching frame
(c) Optical flow of unclenching frame
(d) 2D histogram of unclenching frame
Fig. 8: Optical flow and histograms for clenching and unclenching hands. Green lines represent optical flow vectors with positive magnitude, blue lines - with negative

Ii-F Filtering

Both histograms and discrete difference of force measurements were filtered. For filtering Butterworth low-pass filter was used. Its characteristics are shown in Table I. The cutoff frequency was chosen to be as in our experiments it was the maximum main frequency with which human could fully controllably clench and unclench test object.

Parameter Cut-off frequency Order Sample rate
Value 3Hz 1 59.95
TABLE I: Butterworth filter parameters

Ii-G Machine learning ensemble train and prediction

Derived histograms were used to train a machine learning ensemble. This step is the same for spatial and temporal streams, but each stream has a separate model. We used auto-sklearn [18] to find and train ensembles. For the spatial stream, we used 60 features data points and for temporal 100 features data points.

In total the dataset consisted of two 100 sec videos with 59.95 fps frame rate. That gave us 11990 frames in total. From that dataset 8990 points were used for training and 3000 for testing. Sequential sets of frames were selected for testing. Each set consisted of 1500 images or 25 seconds of video. This data has never been seen by model during the training phase. Successive sequences instead of random frames were chosen for two reasons: Successive frames may be very similar. Thus if train and test sets are chosen randomly, the model can merely remember one frame, and that will allow having a reasonable prediction for a successive frame from the test set. Having successive sequence ease the visualization of results.

As a temporal stream labels average of left and right components of filtered vectors were used. As a spatial stream labels average of left and right components of ground truth force measurements were used.

Ii-H Stream fusion

Predictions of the spatial and temporal streams are very noisy. Moreover, the data is redundant: streams predict the variable and the derivative of the variable. Thus to combine results of temporal and spatial streams and filter out noise Kalman filter was used.

The proposed filter has the magnitude of the force and discrete difference of force as a state vector and as an observations vector (2).


The state transition model addresses the fact, that force magnitude at the next step is the force magnitude at the current step summed up with the discrete difference at the current step. Thus the state-transition matrix can be written as (3):


Because the observations vector and the state vector has the same variables, the observation matrix is an identity matrix.

The covariance matrix of the observation noise was composed from ensemble prediction analysis. Thus, since RMSE of spatial stream and RMSE of temporal stream are … and … accordingly, the initial covariance matrix of the observation had the following form (4):


The covariance matrix of the process noise was chosen to initially have the form of (5). This matrix reflects an assumption that the force on the next step can be calculated with high certainty using the force and discrete difference of force on the previous step. However, the evolution of the derivative of force cannot be predicted.


Expectation-Maximization algorithm was applied to find better covariance matrix of the observation noise and covariance matrix of the process noise. The final covariance matrix of the observation noise is (6):


And the final covariance matrix of the process noise is (7):


Iii Results

Iii-a Filtering

The time plot of filtered and unfiltered discrete difference is shown in Fig. 9.

Fig. 9: Filtered and not filtered discrete difference of ground truth

Iii-B Streams and fusion performance

The time plot of ground truth values, result of the algorithm and single stream results are shown in Fig. 10. The results of the temporal stream, spatial stream, and overall result were close for the first video. However temporal stream cumulative sum produced an invalid result in the second case, posing high accumulated error. Comparison of performance of each stream and overall algorithm for both cases are shown in the Table II. The RMSE score of Kalman filter output was the best compared to the score of every single stream in both cases. The RMSE score of the temporal stream was slightly worse for the first video compared to the overall algorithm and spatial stream. The score of the spatial stream was slightly less than the score of the overall algorithm.

First video Second video
MSE r2 MSE r2
Spatial stream 7350.21 0.905 17597.73 0.878
Temporal stream 8331.59 0.892 567813.82 -2.931
Fused 5795.25 0.925 15884.96 0.890
TABLE II: Each stream and overall results comparison
(a) Output of spatial stream and ground truth data for the first video
(b) Output of spatial stream and ground truth data for the second video
(c) Cumulative sum of the output of temporal stream and ground truth data for the first video
(d) Cumulative sum of the output of temporal stream and ground truth data for the second video
(e) Output of Kalman filter and ground truth data for the first video
(f) Output of Kalman filter and ground truth data for the second video
Fig. 10: Single stream and overall algorithm outputs time plots

The temporal stream demonstrated reduced performance in the measurements close to zero compared to other measurements in the second video (Fig. 10(d)). However the spatial stream did not have significant change of performance dependence on the magnitude of the true value (Fig. 10(a), 10(b)). Filtered value did not have notable performance dependence on the magnitude of value either (Fig. 10(e), 10(f))

(a) Output of spatial stream and ground truth data for the first video
(b) Output of spatial stream and ground truth data for the second video
(c) Output of temporal stream and discrete difference of ground truth data for the first video
(d) Output of temporal stream and discrete difference of ground truth data for the second video
(e) Output of Kalman filter and ground truth data for the first video
(f) Output of Kalman filter and ground truth data for the second video
Fig. 11: Single stream and overall algorithm outputs comparison plots

Iv Discussion and Conclusions

In this article, we proposed a two-stream algorithm for estimation of gripping force from the video. As was expected, temporal stream based on optical flow showed almost as good results as the overall method, which correlates with the conclusion of [15]. That is a very interesting result, as this method did not directly use color information of the hand, only the information on the movements of the hand. This result may be due to a very low sensitivity of the method itself to the lighting condition. However, the method demonstrated reduced performance when slow changes of the force took place. The possible reason is that in this case, overall noise may be indistinguishable from the actual measurement. Another possible reason is the small amount of training data.

However, the precision of spatial stream itself was less than expected. The cause of that may be the probable high sensitivity of the method to the lighting conditions. The delay between the impact and the actual change of the skin color may also be significant. It is also probable that the method is sensitive to uncompensated angular movements of the hand.

Combination of two streams even with the straightforward Kalman filter improved the result of each separate stream. It also expectedly made the method more robust and less sensitive to the current output value.

However, the proposed method in its current implementation has several significant limitations. Firstly it was tested on a limited amount of data. Overall 200 seconds of videos were used to train and test the algorithm. The more extensive dataset with the wide variety of humans and test object forms should be used in the future for training and testing the method. Secondly, a relatively weak model was used for predictions. In future works, we plan to strengthen the model against lighting condition changes, hand orientation changes, camera and hand movements. Hand keypoint recognition methods [19], [20] for rotation, translation, and scale estimation and canceling will also be considered.

Changing the naive CV method for skin segmentation to CNN based approaches [9] may also improve the results. Other adaptive [21], and more sophisticated [22] methods also may be used for this purpose. Use of these methods may increase the robustness of the system for real-world applications.

Overall, the noninvasive approach to measure the grasp position and force from video is presented. The performance of the algorithm reaches RMSE , which is enough for its application in learning forces of action in automated smartphone disassembly and other robotic applications of similar scale. Application of such algorithm to the available online videos can produce a great dataset of actions, annotated not only with the spatial trajectories but also with required characteristics of the forces applied, which can be naturally executed by modern robots.


  • [1] Charissa Rujaevich, Joe Lessard, Sarah Chandler, Sean Shannon, Jeffrey Dahmus, and Rob Guzzo, “Liam - An Innovation Story,” bibtex: liam_2016. [Online]. Available:˙white˙paper˙Sept2016.pdf
  • [2] S. Mascaro and H. H. Asada, “Finger posture and shear force measurement using fingernail sensors: Initial experimentation,” in Robotics and Automation, 2001. Proceedings 2001 ICRA. IEEE International Conference on, vol. 2.   IEEE, 2001, pp. 1857–1862.
  • [3] S. A. Mascaro and H. H. Asada, “Measurement of finger posture and three-axis fingertip touch force using fingernail sensors,” IEEE Transactions on Robotics and Automation, vol. 20, no. 1, pp. 26–35, 2004.
  • [4] Y. Sun, J. M. Hollerbach, and S. A. Mascaro, “Estimation of fingertip force direction with computer vision,” IEEE Transactions on Robotics, vol. 25, no. 6, pp. 1356–1369, 2009.
  • [5] T.-H. Pham, N. Kyriazis, A. A. Argyros, and A. Kheddar, “Hand-object contact force estimation from markerless visual tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  • [6] Y. Sun, J. M. Hollerbach, and S. A. Mascaro, “Predicting fingertip forces by imaging coloration changes in the fingernail and surrounding skin,” IEEE Transactions on Biomedical Engineering, vol. 55, no. 10, pp. 2363–2371, 2008.
  • [7] S. Urban, J. Bayer, C. Osendorfer, G. Westling, B. B. Edin, and P. Van Der Smagt, “Computing grip force and torque from finger nail images using gaussian processes,” in Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on.   IEEE, 2013, pp. 4034–4039.
  • [8] A. Erol, G. Bebis, M. Nicolescu, R. D. Boyle, and X. Twombly, “Vision-based hand pose estimation: A review,” Computer Vision and Image Understanding, vol. 108, no. 1-2, pp. 52–73, 2007.
  • [9] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2016, pp. 4724–4732.
  • [10] Y. Maeda, N. Ishido, H. Kikuchi, and T. Arai, “Teaching of grasp/graspless manipulation for industrial robots by human demonstration,” in Intelligent Robots and Systems, 2002. IEEE/RSJ International Conference on, vol. 2.   IEEE, 2002, pp. 1523–1528.
  • [11] L. Zhao, X. Li, P. Liang, C. Yang, and R. Li, “Intuitive robot teaching by hand guided demonstration,” in Mechatronics and Automation (ICMA), 2016 IEEE International Conference on.   IEEE, 2016, pp. 1578–1583.
  • [12] S. S. Rautaray and A. Agrawal, “Vision based hand gesture recognition for human computer interaction: a survey,” Artificial Intelligence Review, vol. 43, no. 1, pp. 1–54, 2015.
  • [13] N. Chen, S. Urban, C. Osendorfer, J. Bayer, and P. Van Der Smagt, “Estimating finger grip force from an image of the hand using convolutional neural networks and gaussian processes,” in Robotics and Automation (ICRA), 2014 IEEE International Conference on.   IEEE, 2014, pp. 3137–3142.
  • [14] M. A. Goodale and A. D. Milner, “Separate visual pathways for perception and action,” Trends in neurosciences, vol. 15, no. 1, pp. 20–25, 1992.
  • [15] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, 2014, pp. 568–576.
  • [16] J. Adarve and R. Mahony, “A filter formulation for computing real time optical flow,” Robotics and Automation Letters, 2016.
  • [17] D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariharan, “Learning features by watching objects move,” in Proc. CVPR, vol. 2, 2017.
  • [18] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter, “Efficient and robust automated machine learning,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds.   Curran Associates, Inc., 2015, pp. 2962–2970. [Online]. Available:
  • [19] C. Zimmermann and T. Brox, “Learning to estimate 3d hand pose from single rgb images,” in IEEE International Conference on Computer Vision (ICCV), 2017, [Online]. Available:
  • [20] T. Simon, H. Joo, I. Matthews, and Y. Sheikh, “Hand keypoint detection in single images using multiview bootstrapping,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, 2017.
  • [21] F. Dadgostar and A. Sarrafzadeh, “An adaptive real-time skin detector based on hue thresholding: A comparison on two motion tracking methods,” Pattern recognition letters, vol. 27, no. 12, pp. 1342–1352, 2006.
  • [22]

    H. Zuo, H. Fan, E. Blasch, and H. Ling, “Combining convolutional and recurrent neural networks for human skin detection,”

    IEEE Signal Processing Letters, vol. 24, no. 3, pp. 289–293, 2017.