Among the five human senses, touch is an important human perceptional modality for understanding the relationship between human and surroundings. It offers complementary information for realizing the surrounding environment. From this viewpoint, touch or tactile sensing have been an attractive topic in the fields of robotics and haptic sensing for many years . The main physical property for grasping and interacting with objects is the interaction force. Specifically, when a robotic hand attempts to grasp an object, the contact-type haptic sensor will be used to measure the interaction force between the device and object; this improves the grip success rate and enables precise hand manipulations . In case of a person, the visual information sensed by eyes is utilized in addition to the tactile sensing when grapping. Through visual information, we perceive the shape, appearance, and texture of objects and infer the tactile memory learned through past experiences before touching objects. From the viewpoint of neuroscience and psychophysics, Ernst and Banks  investigated the method for sharing information between vision and tactile sensing. Newell et al.  showed that the human brain employs shared models of objects across multiple sensory modalities, e.g., vision and tactile sensing, so that knowledge can be transferred from one to another.
Inspired by the knowledge transfer from vision to tactile sensing , we propose a vision sensor-based method that simulates the tactile sensing, which has a different modality, with only visual sensing information. When humans try to touch an object, they can recall the feeling of the object through prior experience before touching the object. Specifically, if we know what an object is, and we can observe how the appearance of the object changes by a finger, we can predict the interaction force between the object and finger from prior experience. Another focal point of the proposed method is that compared with contact-type haptic sensors, a noncontact-type sensing method could measure the haptic force constantly because the camera sensor is not worn out even when it is used for a long time. Moreover, as an additional touch sensor does not need to be attached to the instrument, the mechanism of the instrument can be miniaturized. In this paper, our computational approach is based on learning haptic information from previous human experiences. The following are two pivotal rules: (1) to recognize what an object is from only images and (2) to predict what kind of interaction force is exerted by the object using sequential images. For this purpose, we collected more than 300,000 images under the various conditions and from different objects and the corresponding databases were used for training and validating the proposed method.
From the viewpoint of the deep learning architecture for predicting the haptic information from the images, the basic deep learning architecture is developed using the convolutional neural network (CNN)-based recurrent neural network (RNN), as shown in Fig. 1 (a). Similar to human perception processes, we first used CNN to analyze target object types and their appearance changes from images, and then analyzed the images over time and used their temporal changes as RNN inputs to eventually estimate the interaction force. For building up network composition, we expect that the attention mechanism , which focuses only the important region in the image for visual question answering (VQA) , helps to improve the accuracy of the force prediction. The main difference between the proposed method and the previous attention networks  is that we used the temporal dynamics-based attention method using the sequential images for predicting the interaction forces.
In this paper, we propose a sequential image-based attention method consisting of a sequential spatial attention module (SSAM) and a sequential channel attention module (SCAM) for gaining better accuracy. By developing the attention module based on the sequential images independent of RNN, as shown in Fig. 1 (b), the concentrated region could be inferred clearly for predicting the haptic force based on the shape changes of a target object. Moreover, although we used both spatial and channel attention modules as used in , the proposed attention modules were modified using the spatial pixel-wised weighted average pooling (WAP) and channel-wised WAP. Unlike in , we trained the SSAM and SCAM independently and finally merged them. The spatial and channel information are mutually exclusive, and they are not easily trained under a unified framework for predicting the haptic forces through the sequential images.
The main contributions of this paper are as follows: (1) a computational method is proposed for predicting the haptic information not by using a haptic sensor but a vision camera. (2) We collected a large number of sequential images and their corresponding force information from the automatic mechanism under various conditions. (3) We also propose a deep learning method based on sequential image-based attention modules for predicting force accurately.
2 Related Work
Studies have also been conducted to measure interaction forces without a force sensor. In 
, a stereo camera was used to reconstruct a 3D artificial heart surface and a supervised learning method was applied to predict the applied force. In, a video-based interaction force estimation method between a human body and an object was proposed using 3D modeling information. In , a single RGB-D camera-based method was used to estimate the contact forces between human hand and an object; the method makes use of only visual information, given geometrical and physical object properties. In , a deep learning-based hand action prediction method was proposed using only visual information; it predicted the force of the fingertips by using the proposed networks. In , the authors focused more specifically on how to predict the interaction force from visual changes of the target objects by using the RNN method. Their work is the first to focus on interaction force prediction from only images without any additional sensor. However, the proposed RNN-based method does not have deeper layers for effectively training all the variations of the visual changes, such as illumination and pose changes. To overcome this issue, we employed the basic framework of the CNN-based RNN method, in which CNN first analyzes the salient visual feature variations by using the proposed sequential image-based attention module, and then RNN works on the serialized features for predicting the final interaction force. Our proposed method can now attain more robust accuracy with respect to conventional image variations, such as different objects, various illumination condition changes, and camera pose variations.
3 Proposed Method
In this section, we propose the dynamic attention module designed for modeling the interaction between objects by using multiple images. As shown in Fig. 1 (a), we adopted the CNN-based RNN  module as the baseline for analyzing sequential images; here, CNN first extracts visual features of each of the frames, and the extracted features are passed to the RNN to predict the interaction forces from complex temporal dynamics. The sequential attention module was used to focus on salient regions and consider temporal dynamic information simultaneously, as illustrated in Fig. 1 (b).
3.1 Baseline: CNN-based RNN method for sequential image description
Visual Feature Extraction
Visual Feature ExtractionCNN is an indispensable source for the representation of images. In the case of sequential data, each frame is represented by the corresponding CNN feature. Each image frame passes through visual extractor as input
, and then CNN generates the fixed-length visual feature vector representation:. To confirm feasibility of our model, we adopted a variant of the VGG model 
as an encoder, which is a common deep CNN architecture. We extracted feature maps from the last pooling layer. The features of each frame are considered as one chunk for one input step of RNN. The resulting frame-level vector was fed into our long short-term memory (LSTM) architecture.
LSTM Sequential Model Given a frame-level feature vector in sequential frames, we used as the input for LSTM, which has been proven to achieve great performance in many sequential problems . Thus, to extract sequential features, we applied an LSTM comprising self-recurrent units and a memory cell, and which can store information dozens of time-steps in the past. We adopted the bidirectional LSTM (BLSTM)  derived from LSTM; it considers all available information in the past and future. As BLSTM uses inputs in two ways, i.e., one from past to future and one from future to past, two hidden-state output exist. We combined these two outputs at the last timestep and then propagated them to a fully connected layer.
3.2 Weighted Average Pooling (WAP)
Recent works  have used the global average pooling for calculating the spatial average of the convolutional feature map; this type of pooling helps achieve better accuracy in visual recognition. Specifically, the global average pooling is efficient to encode a bunch of convolutional feature maps into a vector of limited size. Therefore, many attention methods  employ the global average pooling for extracting the feature vector for predicting the attention regions. However, this method usesd the equal weight average pooling for reducing the dimensionality because of its simplicity and efficiency. In this paper, we argue this simple assumption and propose the WAP method, which can be developed using a 11 convolutional layer for both spatial and channel attentions during training. Moreover, in this study, we exploited multiple images for developing the temporal dynamics-based attention mechanism for CNN feature extraction. Compared with a single image-based method, the size of the convolutional feature map of the multiple image-based method generally increases along with redundant information. Therefore, the proposed WAP encourages the network to emphasize more discriminative information.
As shown in Fig. 2 (a), to average the channel information by using different weights, convolutional feature matrix is split into (). We calculated the weighted average by multiplying each element of weight vector to the corresponding spatial map. In this respect, we simply implement it by applying a 11 convolutional operation. Furthermore, a similar approach could be used for averaging the spatial information with different weights, as shown in Fig. 2 (b). In this case, we reshaped the tensors of the convolutional feature maps to a flat shape, e.g., and applied 11 convolution for gaining the different weight values of the spatial regions.
3.3 Sequential Spatial Attention Module (SSAM)
In general, an interaction occurs between objects in the region that is touched; therefore, the application of a global image feature may lead to a sub optimal result due to the irrelevant region. To solve such a problem, a spatial-attention mechanism has been proposed in many previous works . Such a mechanism focuses on the key regions of information in an image by excluding less important regions, leading to performance improvement. However, most of the previous works  are built on the assumption that only a single frame is used. As the purpose of this work is to predict the interaction force between objects in sequential images, the consideration of dynamic information of each frame is also important. Therefore, instead of only extracting an attention map through a single frame, our attention module attempts to exploit multiple adjacent frames for producing an accurate attention map by considering dynamic information. The overall procedure is illustrated in Fig. 3 (a).
We represent the convolutional feature of the th frame as . The set of convolutional features of consecutive frames at time is denoted by . The overall process can be summarized as follows:
where denotes element-wise multiplication operation, represents the sequential spatial attention map, and is the final refined feature map.
where denotes the convolution operation and represents the concatenated convolutional features from the th image to the th image. To squeeze concatenated feature map by using the proposed WAP, we used 11 convolution kernel , generating projection tensor . Each of represents a linear combination for all channels in spatial location . Next, to generate an attention map, projected map
passes the convolution layer and the sigmoid function is applied as follows:
where is the sigmoid function, represents the convolution filter, and is the bias parameter.
3.4 Sequential Channel Attention Module (SCAM)
Similar to the SSAM, the proposed SCAM also generates salient features by exploiting channel information of adjacent frames. As the amount of channel information increases because of multiple images, redundant channel information also increases. In this case, as pointed in , non-salient channel information caused the problem of distraction. To overcome this issue, we adopted the self-gating attention module based on channel dependence  and the proposed WAP method. Fig. 3 (b) describes the overall block architecture of SCAM.
The set of visual features of sequential frames are given as input.
where represents the sequential channel attention map, and is the final refined feature map,
To squeeze concatenated feature map in the channel axis, we used 11 convolution kernel after reshaping to obtain squeezed vector . Each of represents the linear combination of all spatial positions in channel . Next, the output passes through two MLP layers to provide nonlinear dependencies, and then the sigmoid function is applied as follows:
are the parameter weights of multilayer perceptron, andis the reduction ratio.
3.5 Ensemble Module
The ensemble network has shown better accuracy in many applications . For combining the individual attention networks, Woo et al.  designed the serialized spatial and channel-wise attention modules under a single network. However, in this study, we trained the SSAM and SCAM independently and eventually calculated the average the two individual results based on the late fusion rule. One of the main reasons for this merging by using the late fusion is that two proposed attention mechanisms play different roles and focus on different characteristics for inferring the forces. Specifically, SSAM focuses on the specific spatial regions in images, while SCAM is responsible for evaluating which channels of the convolution layer are important. The learning of two attention methods, SSAM and SCAM, whose characteristics are different under a single network, is not an easy task. Moreover, we used multiple images to learn more temporal dynamics for better performance. The amount of information to be judged by the proposed method is increased compared with that in a single image-based attention method, and the individual learning of the SSAM and SCAM is a better choice for learning their individual purposes efficiently.
4 Dataset and Implementation
4.1 Experimental Setup and Database
For building a fair experimental training and validation protocol, we built the data-collecting system, consisting of a motorized probe system, and captured the images during the interaction between the probe and object while recording the interaction forces. Specifically, Fig. 4 (c) shows a schematic description of this equipment setting. We used the RC servo motor and cam structure, which are attached up to the translation stage for generating translation movement. The end of the tool mounted by the motor moved up and down automatically to apply force on the object. We measured the interaction force between the tool tip and interaction object through a load cell (model BCL-1L, CAS). We captured the images by using a 149-Hz camera (e.g., Cameleon3, CM3-U3-13Y3C-CS, Pointgrey) and stored the collected information as 128010243 (RGB) data. We carefully synchronized between collected image and interaction force. During the interaction between the tool tip and object when collecting information for the dataset, the magnitude of the pressing force and pressing time varied randomly to collect data on various force magnitudes and durations.
|Materials||Training Set||Test Set|
|Sponge||144 set (71,350)||36 set (17,729)|
|Papercup||144 set (71,481)||36 set (18,070)|
|Stapler||144 set (72,325)||36 set (18,129)|
|Tube||144 set (72,253)||36 set (18,076)|
For inferring the interaction force from the images, we selected four objects made of different materials, as shown in Figs. 4 (a) and (b). Each object has different materials and rigidity. In this paper, we collected the object images of a sponge, paper cup, tube, and stapler. The visual images and their corresponding interaction forces were collected through synchronization. For variation in the environment around objects, each object was provided four pressing angle variations () and three levels of light intensities (350, 550, 750 lux), as shown in Figs. 4 (a) and (b). One image set has four contacts to the material and a total of 15 sets were collected for each environment. To build the training and test protocols, we collected approximately 360,000 sequential images (=15 sets500 images4 objects3 lights4 angles) by using an RGB camera, and the corresponding interaction forces were captured through the load cell in the direction. We selected three test sets from each material, and the other sets were used for training the deep learning models. Table 1 summarizes the detailed information about the training and test sets for the four material-based objects.
4.2 Implementation Detail
|conv 1/1||3x3 conv||16|
|conv 1/2||3x3 conv||16|
|conv 2/1||3x3 conv||32|
|conv 2/2||3x3 conv||32|
|conv 3/1||3x3 conv||64|
|conv 3/2||3x3 conv||64|
|conv 4/1||3x3 conv||128|
|conv 4/2||3x3 conv||128|
|conv 5/1||3x3 conv||256|
|conv 5/2||3x3 conv||256|
We learned the network weights through the mini-batch stochastic gradient descent by using Adam for 120 epochs. The initial learning rate is le-4 and 1/10 was multiplied per 30 epochs. At each iteration, a mini-batch of 64 samples was constructed by sampling 20 training sequential frames, and from each frame, an object was randomly selected. The image then underwent cropping and resizing to a gray-scaled 128128-pixel image. In the experiment, as the baseline, the variant of VGG network was used to extract visual features. As described in Table 2, the network is composed of 10-layers and outputs 256-channel feature vectors. We also experimented with an 18 layer-based Resnet 
to verify that our proposed model works well on other CNNs. For exploiting temporal dynamics, we used the BLSTM network with 256 hidden units and 20 timestep. The last hidden unit feature that was concatenated was fed to 1024 fully connected layers. Finally, to predict the 1-dimensional interaction force, we adopted the linear-regression model. We trained all models from scratch and measured the performance by using the root mean squared error (RMSE) and mean absolute error (MAE). In this paper, we used MAE as the standard measurement for performance comparisons.
5 Experimental Results and Discussion
5.1 Experimental Results on Proposed Sequential Attention Module
Table 3 experimentally shows that spatial and channel attention methods help to improve performance of the baseline, CNN-based LSTM, by more than 9% for predicting the interaction forces using images. The channel attention method (or SCAM) is always better than spatial attention method (or SSAM) in this paper because going to the high layers of CNN, the high-level features are found at the channel maps of CNN, not the spatial maps. Moreover, the proposed ensemble method by merging the results of the spatial and channel attention method leads to more than 27% improvement over each attention method. Compared with the single frame-based method, e.g., 0.034 MAE, the proposed method based on the multi-frame always shows better results, e.g. 0.032 MAE at the ensemble works. It means that the attention map for inferring forces could be effectively generated by exploiting the temporal dynamics of the target object. Quantitative evaluation was conducted to find the optimal multi-frame bounds. Fig. 5 shows that the best performance was achieved by using only the previous frame, .
We empirically verify that our proposed pooling method is the effective to squeeze sequential frames information. We compare two methods of averaging the feature maps: our weighted average pooling and global average pooling. From Table 4, we conclude that the proposed method is superior to handle the concatenated sequential information for predicting the forces.
5.2 Experimental Result on different network architecture
To validate the generality of our method, we apply our model to ResNet , one of the well-known deep learning architectures. Table 5 shows the comparative results between VGG-like one and ResNet and we can know the proposed method works successfully regardless of the used architecture types. For example, ResNet-based method also achieves 30% better MAE compared with the baseline.
5.3 Comparative Evaluation with well-known method
We conducted comparative analysis with other well-known attention methods. In Table 6, we provide a summary of the comparative evaluation results on inferring the interaction forces using our dataset, obtained by our proposed attention module and the recent state-of-the-arts techniques including the approaches based on attention mechanism . The proposed method shows its superiority among the previous works. Note that the previous works such as  are not designed for making the attention map from the sequential images and it results in this performance degradation.
5.4 Performance analysis according to force intensity changes
To better understand the reasons why the proposed method improves the performance over the baseline method, we divide the force magnitude into 11 bins, each of which spans a force interval as shown in Fig. 6. We used MAE measurement for each force interval to validate how the other methods, e.g., the single-frame based attention method and the proposed method, improved compared with the baseline method. From Fig. 6, we can confirm once again that the proposed method of generating the attention by using sequential images improves performance in most of force intervals. Especially, in relatively strong force intervals, e.g., , the proposed method achieves average 16% better improvements than the single-frame based attention model. Since the shape changes of the target object become large when the external force is strong, the proposed method effectively makes use of the pixel differences between the sequential images for generating the attention maps. From to , the image differences are also large, which helps to making better attention maps because the tip of the tool begins to touch the target object. On the other hand, in , the external force is constantly applied to the object and the shape changes of the target are not relatively large. As a result, the attention map is not generated precisely, and the performance is slightly worse than the baseline method. However, the proposed method has better performances than the single frame-based attention method.
5.5 Performance analysis on Various Material
Fig. 7 shows the proposed method successfully has predicted the interaction forces from only images even if the interaction forces are randomly changed. This good performance is observed regardless of which object is used for experiments. In more detail, Fig. 8 shows how the proposed method is better than the baseline method when the external force reaches the peak points. The baseline method estimated the peak point of the interaction force well at first, but its predictions are not stable while the predicted results of the proposed method is closer to the ground truth and is more stable simultaneously. In this respect, we conclude that the temporal dynamics are useful for generating the attention map at CNN, even though LSTM analyzes the temporal information.
Table 7 describes the performance improvements according to the different target object and Fig. 9 illustrates the spatial attention map generated by the proposed methods. Sponge is an object of good elasticity. Compared with the other objects, the shape change of the sponge by external force is most apparent and it leads to the good results. The proposed method shows the best result on the papercup, because the complex surface textures represents rich visual information. For that reason, it has high estimation accuracy compared to other rigid object. As shown in second rows of Fig. 9, the network focuses mainly on the top and bottom textures of the papercup which the significant changed parts by the external forces are. Tube is composed of plastic rubbers. It is relatively softer than the others and the surface change is not obviously observed when the touch is started. For this reason, the proposed method shows slightly low improvement on the tube. In case of a stapler, because the stapler is made of solid materials, the shape change pattern is always constant when the external force is applied. In this respect, the temporal dynamics play a pivotal role in predicting the interaction forces, and we can confirm this through the experimental results in Table 7. The improvements of the single image-based attention method and the proposed method are 129% and 141%, respectively. Compared with the other objects, this 12% improvement is unique and significant.
For predicting the interaction force from the images, we have represented a sequential image-based attention module which learns a salient model from temporal dynamics. We also proposed a weighted average pooling layer for both spatial and channel attention modules, and the result is made by the ensemble of these modules. To verify our method, we collect 359,413 images and corresponding interaction forces by an electronic motor-based device. Extensive experiments show the effectiveness our method, which achieves better performance, compared with well-known single-frame based methods. We observed that our proposed method encourages the network to concentrate on interaction region for inferring interaction forces successfully. From this result, we hope our proposed method become a good initial research in the field of predicting force using one vision sensor.
-  J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural module networks. , 2016.
-  A. Aviles, S. Alsaleh, J. Hahn, and A. Casals. Towards retrieving force feedback in robotic-assisted surgery: A supervised neuro-recurrent-vision approach. IEEE Transactions on Haptics, 10:431–443, 2016.
W. H. Beluch, T. Genewein, A. Nurnberger, and J. M. Kohler.
The power of ensembles for active learning in image classification.IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  A. Cirillo, F. Ficuciello, L. Sabattini, C. Secchi, and C. Fantuzzi. A conformable force/tactile skin for physical human-robot interaction. IEEE Robotics and Automation Letters, 1:41–48, 2016.
-  J. Donahue, L. A. Hendricks, M. Rohrabach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Analysis and Machine Intelligence, 39(4), Apr. 2017.
-  M. O. Ernst and M. S. Banks. Humans integrate visual and haptic information in a statistically optimal fashion. Nature,, 415(6870):429, 2002.
-  C. Fermüller, F. Wang, Y. Yang, K. Zampogiannis, Y. Zhang, F. Barranco, and M. Pfeiffer. Prediction of manipulation actions. International Journal of Computer Vision, 126(2–4):358–374, Apr. 2018.
-  A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. International Conference on Acoustics, Speech, and Signal Processing, 2013.
-  V. Grosu, S. Grosu, B. Vanderborght, D. Lefeber, and C. Rodriguez-Guerrero. Multi-axis force sensor for human–robot interaction sensing in a rehabilitation robotic device. Sensors, 17:1294, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
-  J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  W. Hwang and S. Lim. Inferring interaction force from visual information without using physical force sensors. Sensors, 17(11), Oct. 2017.
-  C. T. Landi, F. Ferraguti, L. Sabattini, C. Secchi, and C. Fantuzzi. Admittance control parameter adaptation for physical human-robot interaction. IEEE International Conference on Robotics and Automation, 2017.
-  S. Lim, H. Lee, and J. Park. Role of combined tactile and kinesthetic feedback in minimally invasive surgery. International Journal of Medical Robotics and Computer Assisted Surgery, 11(3):360–374, 2015.
-  Y. Liu, H. Han, T. Liu, J. Yi, Q. Li, and Y. Inoue. A novel tactile sensor with electromagnetic induction and its application on stick-slip interaction detection. Sensors, 16:430, 2016.
-  F. N. Newell, M. O. Ernst, B. S. Tjan, and H. H. Bulthoff. Viewpoint dependence in visual and haptic object recognition. Psychological Science, 12(1):37–42, 2001.
-  T. Pham, A. Kheddar, A. Qammaz, and A. Argyros. Towards force sensing from vision: Observing hand-object interactions to infer manipulation forces. IEEE Conference on Computer Vision and Pattern Recognition, pages 2810–2819, Jun. 2015.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations, 2015.
-  Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepfface: Closing the gap to human-level performance in face verification. IEEE Conference on Computer Vision and Pattern Recognition, 2014.
-  W. M. B. Tiest and A. M. Kappers. Physical aspects of softness perception. Luca MD (ed) Multisensory Softness, Springer, pages 3–15, 2014.
-  F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. Residual attention network for image classification. IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  S. Woo, J. Park, J. Lee, and I. Kweon. Cbam: Convolutional block attention module. European Conference on Computer Vision, Sept. 2018.
-  Z. Xu, J. Hu, and W. Deng. Recurrent convolutional neural network for video classification. IEEE International Conference on Multimedia and Expo, Jul. 2016.
-  H. Zhang, R. Wu, C. Li, X. Zang, X. Zhang, H. Jin, and J. Zhao. A force-sensing system on legs for biomimetic hexapod robots interacting with unstructured terrain. Sensors, 17:1514, 2017.
-  X. Zhang, T. Wang, J. Qi, H. Lu, and G. Wang. Progressive attention guided recurrent network for salient object detection. IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  Y. Zhu, C. Jiang, Y. Zhao, D. Terzopoulos, and S. Zhu. Inferring forces and learning human utilities from videos. IEEE Conference on Computer Vision and Pattern Recognition, pages 3823–3833, Jun. 2016.