Deep Learning for Surface Material Classification Using Haptic And Visual Information

12/21/2015 ∙ by Haitian Zheng, et al. ∙ USTC 0

When a user scratches a hand-held rigid tool across an object surface, an acceleration signal can be captured, which carries relevant information about the surface. More importantly, such a haptic signal is complementary to the visual appearance of the surface, which suggests the combination of both modalities for the recognition of the surface material. In this paper, we present a novel deep learning method dealing with the surface material classification problem based on a Fully Convolutional Network (FCN), which takes as input the aforementioned acceleration signal and a corresponding image of the surface texture. Compared to previous surface material classification solutions, which rely on a careful design of hand-crafted domain-specific features, our method automatically extracts discriminative features utilizing the advanced deep learning methodologies. Experiments performed on the TUM surface material database demonstrate that our method achieves state-of-the-art classification accuracy robustly and efficiently.



There are no comments yet.


page 1

page 4

page 5

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With today’s sensor technology, a wide variety of data types can be captured. Understanding the sensory data and recognizing objects is becoming an important research topic. Recent work in object recognition [1]

and indoor scene recognition

[2], for instance, demonstrate that combining visual information with depth sensory input improves the classification performance. Taking material surface classification as another example, besides camera sensors, the easily-accessible acceleration sensor is able to record vibration signals when it slides over a surface. Such vibration signals capture information about the material properties of the surface, and also reveal rich haptic attributes. These haptic signals are complementary to visual input, providing an opportunity towards a better material classification scheme. In this paper, we investigate such multimodel material surface classification problem which involves both visual and haptic inputs.

Surface material classification has recently gained increasing interest. When a rigid tool slides on a surface, the resulting vibrations of the tool contains information about characteristic properties of the surface [7], such as hardness and roughness of the material. There has been increasing interest to recognize surface materials using robots [7][12][14][15] and to recreate the haptic feel of real surfaces. However, surface material classification from tool / surface interaction data becomes particularly challenging if free-hand movements are considered.

Fig. 1: Surface texture data trace, reproduced from [7]

. It can be observed that higher velocities (left to right) increase the signal power and variance, while the exerted normal force is held constantly.

When a human strokes a rigid tool over an object surface, the exerted normal force, the tangential scan velocity, and the angle between the tool and the surface might vary during the surface exploration and between subsequent exploration sessions. These scan-time parameters strongly influence the nature of the recorded acceleration signals [9]. Fig. 1 shows an example of an acceleration data trace, where the scan velocity is linearly increasing and reveals, how this change influences the data trace with regard to its signal power and variance. The variability of the acceleration signals thus complicates the texture classification process.

Before surface classification using acceleration data (captured while interacting with the surface) emerged [7][12][13][14][15]

, a significant number of previous works have focused on using photos of material surfaces to classify the material types. These approaches mainly rely on hand-crafted image features including locally binary pattern (LBP) features

[4], filter bank features [3][6], co-occurrence matrix based features [5]

etc, in combination with appropriate machine learning tools to distinguish the different texture types.

Designing specific features requires specific domain knowledge. Recently, Convolutional Neural Networks (CNN) have become a popular tool for pattern recognition, allowing better features being extracted automatically. In the context of surface classification,

[24][27] aim to classify texture image patches by training CNNs, [28]

designs fisher-vector descriptors of texture images using the ImageNet pretrained CNN,

[29] uses CNN to learn texture image filter banks and [31] designs dynamic texture descriptors utilizing the ImageNet pretrained CNN. Meanwhile in the context of haptic classification, our previous work [12] proposes an auto-encoder pre-trained CNN for classifying texture haptic segments111[12], a preliminary version of this work, has appeared at MLSP 2015. It mainly focuses on directly dealing with the one-dimensional haptic signal without any preprocessing procedures. As an extension, this work dives into a more efficient solution with hybrid-inputs (haptic as well as image signals).. However,

  • compared to the texture signal (image or haptic) which could be of arbitrary size, CNN has a relative small receptive field. In order to reconcile the disagreement of the two mentioned sizes, an inefficient sliding window based approach needs to be adapted, which might jeopardize the efficiency of a real-time application.

  • regardless of the significant progress in haptic (acceleration)-only or image-only classification, there are rarely approaches dealing with hybrid data input. For instance, two types of surface materials with similar image appearance can lead to completely different acceleration data, and vice versa. In such cases, better classification performance can be achieved by utilizing both haptic (acceleration) data and image data.

To overcome the inefficiency of CNN + sliding windows scheme, a Fully-Convolutional Neural Network (FCN) approach is adapted in our work. FCN [20] is a special type of convolutional neural network which replaces fully connected layers with convolutional layers with a convolution kernel. Without fully connected layers, the FCN is able to take input of arbitrary size, and outputs label predictions at every receptive field. More importantly, compared to the approach of ‘CNN + sliding window’, the FCN can be trained and tested more efficiently.

Different from previous approaches in adapting FCN for vision tasks, we propose a systematic FCN scheme for recognizing surface materials from hybrid data (haptic signals and images). The FCN for haptic data recognition is trained using concepts developed for speech recognition, as haptic data share similar characteristics with speech data [7], [13]; the FCN for image-based texture recognition is trained by fine-tuning the network weights from [17]

, inspired by transfer learning

[43]. Afterwards, additional hybrid networks further integrate the haptic/visual features for better classification performances. Experiments conducted on publicly available surface material datasets [7][25][26] demonstrate the superior performance of our scheme in terms of both efficiency and accuracy for surface material classification. While the concurrent work [16] also applies deep learning for haptic/visual hybrid input-based surface classification, we would like to point out that 1) We are eventually handling very different dataset from [16]. In [16], the material object is supposed to be at the center of a fixed-size image, and the haptic trace records the fixed-length signal when robot gripper explores specific procedures such as squeeze, hold, etc. While the TUM image/haptic-texture dataset contain repetitive patterns, and can be of arbitrary size. 2) Given the fix-size dataset, [16] developed a one-shot CNN classification framework. To deal with the dataset with repetitive patterns and arbitrary size, we develop a ‘FCN + max-voting’ framework, where FCN improves the naive ‘CNN + sliding windows’ approaches by speed, and the max-voting further boosts the FCN prediction accuracy. 3) Although [16] tested image/haptic fusion, they did not perform the joint training of fusion network. In this paper, with the help of the larger dataset, the joint training is possible and performed.

The remaining paper is structured as follows. In Section II, we discuss the related work and present CNN and FCN. In Section III, the details of our method are elaborated. Section IV describes the surface classification experiment on the TUM dataset and discusses the additional results. Finally, Section V concludes the paper.

Ii Related Work

Convolutional neural networks (CNN), a type of trainable multistage feed-forward artificial neural network that extracts a hierarchical feature representation, are a powerful tool for both image and speech recognition. A typical CNN consists of convolution layers, pooling layers, and fully connected layers. The main ingredients of a CNN can briefly be summarized as follows:

  • Convolution layers extract feature maps from the input by applying consecutive convolution operations between the input and trainable kernels, followed by a non-linear activation function.

  • Pooling layers usually follow convolution layers, aiming to reduce both the dimensionality and translation sensitivity of the input feature maps. For image recognition, pooling layers significantly boost spatial translation invariance, while for spectrogram recognition pooling layers lead to temporal-frequency invariance.

  • Fully-connected layers are usually used at the ending stage of a CNN, providing more flexible feature mapping. In a fully-connected layer, the input feature vector is linearly converted into a new feature vector before being fed into a non-linear activation function.

  • Softmax layer is usually used for handling the multiple label regression problem. At the end of the neural network, the softmax layer outputs the normalized exponential of the input vector, which indicates the probability of each label.

The activation function of the CNN can be chosen among sigmoid function, tanh function, and rectified linear (ReLu) function

[21], among which, ReLu becomes more and more popular due to its efficiency for training and effectiveness for improving the classification performance. Additionally, dropout regularization [22] is commonly used after fully-connected layers, which significantly reduces co-adaptation between features, and hence prevents over-fitting and boosts the classification performance significantly.

CNNs have experienced great success in a wide range of vision tasks, including image recognition [17], [18], detection [49], [50], segmentation [20] and many specific applications such as image aesthetics assessment [32] and eye fixations prediction [33]. In image recognition specifically, with the help of large datasets (i.e., ImageNet), CNN methods [17] [18] [19] have taken over the lead in large scale visual recognition challenges (ILSVRC) since 2012. For a small dataset, learning the millions of parameters of a CNN is usually impractical and may lead to over-fitting. CNNs, however, still successfully show their power – studies on transfer learning [43] [44] [45] show that the trained CNN model for one specific vision task usually learns a good representation of natural images, which works for other visual recognition tasks as well. Inspired by the success of transfer learning, we investigate in this paper how to handle the surface texture recognition task using the relatively small TUM image dataset [7].

CNN has also been shown to be a powerful tool for speech recognition. Recent works [34] [35] [36] [37] [38] show that CNNs notably outperform fully-connected deep neural network (DNN). The superior performance is attributed to the property of temporal-frequency translation invariance inherent with CNN [36]. In addition to speech recognition, CNNs are applied to acoustic recognition tasks such as music genre classification [40], music onset detection [41] and music adversaries [39]. Motivated by the previous work on 1-D speech signal recognition, and evidence in [7][13] which show that the acceleration signals captured during the interaction with an object surface share certain characteristics with speech signals, we investigate a CNN method to deal with our haptic acceleration signal based recognition task.

Though CNNs provide powerful recognition abilities, additional designing is required to handle the special properties of the acceleration signals which describe the surface material. Specifically, the input can be of arbitrary size and contains frequently repeated local patterns. Thus it is more desirable to take local signal segments as the input, rather than the entire signal. Intuitively, the most simple texture prediction pipeline could be: 1) convert the acceleration input into segments using sliding windows; 2) predict every segment with a trained CNN; 3) perform max-voting among multiple CNN predictions. Our experiments show that with a carefully trained CNN, the aforementioned work-flow achieves high prediction accuracy. However, a sliding window based approach is inappropriate for real-world applications, as dense prediction of the CNN is computationally expensive.

In our work, to alleviate this inefficiency, the naive sliding windows approach (step 1 and 2) is replaced by the FCN. FCN is a special type of convolutional neural network which replaces fully connected layers which output features with convolutional layers with convolution kernels. The key observation of FCN is that fully-connected layers in a CNN are special convolutional layers: convolutional layers which take a feature map and output another feature map. In contrast to a fully-connected layer which takes fixed-length input merely, the corresponding convolutional layer is generalized to take input of arbitrary size, which is extremely beneficial for extracting dense features at every spatial location.

Taking this key observation into account, the FCN can replace the fully-connected layers of a CNN by convolutional layers with the same weights. Given the input data of arbitrary size, FCN first performs convolution and max-pooling operation alternatively as a usual CNN, then performs multiple

convolution operations (and drop-out), finally provides label predictions at every receptive field. Compared to the sliding windows approach which heavily re-computes feature maps within overlapping “windows”, FCN is significantly less computationally expensive. As further noted in [20], both training and inference of FCN can be performed by standard neural network approaches, leading to an efficient and systematic scheme.

Iii Proposed CNN Scheme with Hybrid Input

In this section, the proposed CNN scheme with hybrid input will be elaborated by starting with the explanation of hybrid data recording (Section III-A), followed by the surface texture classification schemes (Section III-B) via HapticNet, VisualNet and FusionNet.

Iii-a Hybrid Data Recording

Fig. 2: Haptic stylus used for texture analysis.

Haptic Acceleration Data In our work, we use the haptic stylus from [13], which is a free-to-wield object with a stainless steel tool-tip, shown in Fig. 2. In [13], a three-axis LIS344ALH accelerometer (ST Electronics) with a range of was applied to collect the raw acceleration data traces. All three axis were combined to one using DFT321 (see [11]). This approach, which preserves the spectral characteristics of the three axes, was adapted in order to have less computational effort in terms of feature calculation.

Image Data Different from acceleration modality, images are taken in a non-dynamic way of recording. However, differences in distance, rotation, light condition as well as focus also complicate the collection of uniform images for the surface classification task.

Fig. 3: Materials included in the TUM haptic texture database, freely accessible at

Fig. 3 shows all the used surfaces and their abbreviated names, where the images of the surfaces are taken by the rear camera of a common smartphone (Samsung S4 Mini) and have a resolution of 8 Mega-Pixels. The illumination and viewing conditions differ within the same class of surfaces. For each surface, there are 5 images under daylight and 5 images under ambient light conditions. Viewing direction and camera distance are chosen arbitrarily for each picture, resulting in variations within each individual class. As an example, we choose three textured surfaces and plot the raw acceleration and image data in Fig. 4.

Fig. 4: Example signal traces of the used image and acceleration data recordings. Arbitrary variations during the recording of the acceleration data traces have been applied. Also, the ten images per textured surface were captured under varying distances, light conditions, focus conditions as well as different camera inclinations towards the surface.

Iii-B Surface Material Classification

Fig. 5: The pipeline of the classification scheme. (a) classification of haptic input (HapticNet); (b) classification of image input (VisualNet); (c) classification of hybrid input (FusionNet).p

Haptic Trace/Image Preprocessing As shown in Fig. 1, the haptic acceleration signal usually starts with a short initial impulse signal, when the rigid device initially touches the surface. The following data is a much longer movement signal when the device moves over the surface. As demonstrated by many speech recognition works [36][37][38][40], converting a 1-D signal into the spectral domain is helpful for the CNN to achieve translation invariance in both temporal domain and frequency domain. Therefore, the 1-D raw acceleration data of the steady-state movement signal is transferred to its spectrogram.

A Hamming window in the time domain is used for enframing haptic signals, where the Hamming window length is set to 500, and the window shift is set to 100. At a sampling rate of 10 kHz, this is equivalent to 50 ms for the window size. Following [8], acceleration segments recorded during unconstrained exploration procedures generally stay stationary in such frame sizes. We select the first 50 low-frequency channels from the spectrogram, which preserved the most of energy from the haptic signal. Finally, the spectrogram is normalized such that the response in each channel has a minimum and maximum value of and , respectively.

Image inputs are resized to half-size for preprocessing. In this way, most of the texture patterns in an image can be preserved in the receptive fields window of AlexNet [17] (please refer to VisualNet for more detail on our visual classification approach).

HapticNet Unlike previous work which applies CNN for haptic surface material recognition, we use a trained FCN to achieve dense prediction for the spectrogram input. FCN replaces the CNN’s fully-connected layers by convolutional layers. Without fully-connected layers, FCN produces dense prediction for the input with arbitrary length. Usually adjacent predictions share overlapping receptive field and intermediate features. Since FCN gets rid of the computational redundancy for the overlapping intermediate features, compared to the approach of ‘CNN + sliding windows’, FCN is much more efficient.

Our proposed FCN network for the haptic (acceleration) data, denoted as HapticNet, takes input of arbitrary length, and the output is a sequence of softmax-vectors of length ,

representing the categorical probability distributions at different temporal locations (illustrated in Fig.

5(a)). The detailed structure of HapticNet is shown in Table I. The HapticNet consists of three normal convolution max-pooling layers, and two convolutional layers which replace the CNN’s fully connected layers. The following layer of the final convolutional layer is the softmax layer, which gives categorical probability distributions at every temporal location. The detailed structure and layer configurations of the HapticNet are shown in Table I.

layer type

patch size / stride

channel size
convolution 3*3 / 1 50
pooling 2*2 / 2 50
normalization - 50
convolution 3*3 / 1 100
pooling 2*2 / 2 100
convolution 3*3 / 1 150
pooling 2*2 / 2 150
convolution 3*3 / 1 200
pooling 2*2 / 2 200
convolution 4*12 / 1 400
dropout - 400
convolution 1*1 / 1 250
dropout - 250
convolution 1*1 / 1 69
softmax - 69
TABLE I: Detailed structure and layer configurations of the proposed HapticNet.

With the predicted output distribution vectors at each temporal location by the FCN, the corresponding class labels are obtained by finding the label with maximum probability, i.e.,


where represents the -th element of vector , and enumerates all texture categories. In order to obtain the class label of the entire haptic signal from multiple predictions , a max-voting procedure that selects the class label with the maximum vote is adapted,


VisualNet The classification pipeline for image data is similar to the one used for haptic data, i.e., an input image of arbitrary size is fed into a FCN network (denoted as VisualNet) for outputting categorical probability distributions of size , at every spatial position. As illustrated in Fig. 5(b), the class labels are obtained by applying Eqn. (1) for every . Then, the class label of the entire image is obtained by applying Eqn. (2) accordingly.

The structure of VisualNet is motivated by AlexNet which was proposed in [17]. In order to provide dense prediction, the last three fully connected layers are replaced by convolutional layers (to make the representation simple, we denote the three convolutional layers as , and respectively). In addition, the number of output features of , and are changed accordingly from {4096, 4096, 1000} to {300, 250, 69}. The numbers of output features of and are reduced to prevent overfitting. We use the learned convolutional layers from AlexNet to initialize our corresponding top convolution layers, and random weights are used to initialize the remaining convolution layers. The overall pipeline for image classification (VisualNet) is shown in Fig. 5(b), which predicts a distribution vector at every spatial location given the image input.

FusionNet Suppose that two material surfaces A and B are similar in their visual appearance, but completely different with regard to their haptic perception. In such case, the haptic features will compensate the visual features for better indicating the surface type, and vice versa. Generally, in the scenario where both haptic and image data can be easily captured, a fusion framework can be proposed, where the hybrid input serves for complementing each of the individual information sources, providing richer features for the material surface classification task.

To fuse predictions using both haptic and image data, the FusionNet structure is proposed where image and haptic signals are fed to VisualNet and HapticNet respectively; then haptic/visual features are randomly sampled from HapticNet/VisualNet’s convolutional feature maps for times; finally, the sampled haptic features and visual features are concatenated and fed into a convolutional layer with 69 outputs for final predictions. Given multiple predictions like this, the max-voting is applied to obtain the final output class label. The structure of the proposed FusionNet is depicted in Fig. 5(c).

Training the FusionNet end-to-end is non-trivial, due to the described random sampling procedure which is not a common routine for standard deep neural networks. However, note that when input sizes of FusionNet are equal to the respective receptive field sizes of HapticNet/VisualNet, only one haptic/visual feature can be sampled from FusionNet. In such case, the random sampling procedure can be ignored, while FusionNet becomes a conventional fusion network. Exploiting such trait, FusionNet training can be much simplified. During testing, however, the described FusionNet structure still remains. In all, training of FusionNet can be simple, while testing of FusionNet can be as efficient as our proposed HapticNet and VisualNet.

When training FusionNet, the FusionNet weights are initialized from the pretrained weights of VisualNet and HapticNet. Also during training, a much larger learning rate (10 times to 50 times) can be employed for the final convolution layer to speed up the training. More details about the training of FusionNet can be found in the experimental results section of the paper.

Iv Experimental Results and Discussions

Dataset The proposed HapticNet, VisualNet and FusionNet are evaluated using the TUM Haptic Texture dataset [7], which contains texture classes and each class consists of 10 sampled haptic traces + 10 images, respectively. For the haptic data, these free-hand sampled traces vary in force and velocity. For the image data, it contains various samples under different lighting conditions.

The dataset is separated into training set and testing set using a ten-fold cross validation. In each fold, the training set contains 9 haptic traces and 9 images for each texture class, and the testing set contains the one remaining haptic data trace and image in each class. The training set is used to train the proposed network, and the testing set is used to evaluate the performance of the network and the following max-voting scheme. Before the ten-fold cross validation, a train/validation/test split is adapted for networks hyper-parameters tuning.

Training Details

Our implementation is based on the Caffe deep learning framework

[46], using a computer with 8GB RAM and Nvidia GPU GTX-860M. The training of HapticNet, VisualNet and FusionNet is conducted by applying Adam [48], an gradient-based stochastic optimization algorithm. For training each network, the learning rates are initialized to the base learning rate , then dropping by a constant factor every iterations. Different choices of , and , as well as the total training iterations for each network is depicted in Table II. The weight decay is set to . Parameters of Adam are set to the default value proposed by [48], specifically , and .

During the testing stage of HapticNet, the preprocessing procedures described in Section III-B are first applied to the haptic tracks. The preprocessed data are then input to HapticNet. In the training stage, however, the haptic spectrograms are subsampled using a fixed length, then being fed into the network for training. By enforcing subsamples to be consistent in length when training HapticNet, we are able to increase the mini-batches size and thus speed up training. The size of the training subsample and mini-batch is depicted in Table II.

Similarly, in the testing stage of VisualNet, preprocessed image is fed into the network for prediction. In the training stage, the preprocessed images are first subsampled into fixed-size patches, then fed into the network for training. Data augmentation using random rotation is also applied during VisualNet training. The size of the training subsample and minibatch is also depicted in Table II.

To train FusionNet, we restrict the size of the haptic and image input to be small enough, such that HapticNet will generate a single prediction for haptic input and VisualNet will only generate a single prediction for image input (as also described in FusionNet of Section III-B). The input size and mini-batch settings for FusionNet are depicted in Table II. During testing, we set the randomly sampled number described in FusionNet of Section III-B to 1000. However, we do not observe other numbers to induce a significant influence on the classification performance.

TABLE II: The hyper-parameters for training HapticNet, VisualNet and FusionNet.
Haptic Classification Fragment Max voting
MFCC + GMM [7] 80.23%
Modified MFCC Decreasing

+ Naive Bayes 

ACNN [12] 81.8%
HapticNet 85.3% 91.0%

Visual Classification
Fragment Max voting
TCNN [29] 87.1%
VGG-M-FV [28] 73.7%
AlexNet-FV [28] 78.7%
VisualNet 85.6% 93.3%
VisualNet-TCNN 87.1% 95.5%

Hybrid Classification
Fragment Max voting
FusionNet- 95.0% 98.1%
FusionNet- 96.2% 98.4%
FusionNet--TCNN 96.6% 98.4%
FusionNet--TCNN 96.6% 98.8%

TABLE III: The surface classification results using haptic data and / or visual data.

HapticNet The classification accuracy is shown in the Haptic Classification part of Table III. Specifically, we compare our results with several existing methods [7][13][12]. [7]

proposes to combine MFCC features with a Gaussian Mixture Model (GMM) for movement phase recognition, which is denoted as ‘MFCC + GMM’ here. The following work

[13] carefully discusses variant features for representing the movement signal, and variant discriminative models that are proposed for giving predictions. Considering [7] uses “averaging feature” to test the prediction accuracy, which performs an underlying model averaging, the obtained results are listed in the max-voting column. [12] is the first work which uses CNN to classify the raw haptic data. By comparison, HapticNet achieves superior performance with the fragment accuracy be 85.3% and max-voting accuracy be 91.0%, using the ten-fold cross validation measurement.

VisualNet The classification accuracy of the concerned scheme is shown in the Visual Classification part of Table III. Specifically, we compare our approach VisualNet with the T-CNN-3 (denoted as TCNN) in [29] and the descriptor (denoted as VGG-M-FV) in [28], as well as the descriptor by replacing VGG-M network with AlexNet (denoted as AlexNet-FV). We observe that VisualNet achieves comparable fragment accuracy as TCNN [29], and higher accuracy than VGG-M-FV [28] and AlexNet-FV [28]. To demonstrate that the VisualNet framework is easily adaptable for other neural network design, we adapt the TCNN into a fully convolutional version, which is denoted as VisualNet-TCNN. VisualNet-TCNN is achieved by reserving the first three convolutional layers of VisualNet, followed by an average pooling layer (which is similar to TCNN) and three convolutional layers. VisualNet-TCNN achieves highest accuracy for classifying images of surface materials.

Additionally, we test VisualNet and VisualNet-TCNN on other texture image datasets, including Kylberg Texture Dataset [25] and KTH-TIPS-2b [26]. Our experimental results are reported in Table IV. Note that the train/test split of Kylberg Texture Dataset and KTH-TIPS-2b follows the paper [29]. On Kylberg Texture Dataset, the T-CNN-3 (denoted as TCNN) in [29] and the D descriptor [30] are compared with our approach. From comparison, VisualNet achieves the best performance, with very similar yet slightly higher accuracy than Visual-TCNN. It is consistent with the conclusion in [29], which suggests AlexNet slightly outperforms TCNN on Kylberg dataset. On KTH-TIPS-2b dataset, TCNN and the VGG-M-FV descriptor [28] are compared with our approach. From the comparison we can conclude that VisualNet achieves competitive performance. We notice that max-voting obtains only a slight performance gain for the fragment accuracy. This is due to testing image in KTH-TIPS-2b dataset is resized to – very small size for the receptive field sizes of VisualNet and VisualNet-TCNN.

Kylberg Texture Dataset Fragment Max voting
VisualNet 96.9% 97.8%
VisualNet-TCNN 96.3% 96.9%
TCNN [29] 96.00%
D [30] 82.0%
KTH-TIPS-2b Dataset Fragment Max voting
VisualNet 72.0% 72.1%
VisualNet-TCNN 72.4% 72.4%
TCNN [29] 72.36%
VGG-M-FV [28] 73.3%
TABLE IV: The VisualNet classification results Kylberg Texture Dataset and KTH-TIPS-2b dataset.

FusionNet For the experiments of classifying hybrid (haptic+visual) input, we first denote the last three convolutional layers of HapticNet/VisualNet as , and respectively. The FusionNet is tested with two different settings: haptic features and visual features from either layers or layers are fused (the resulting network are denoted as FusionNet- and FusionNet- respectively in the Fusion Classification part of Table III). As we expected, the accuracy of haptic predictions and image predictions is significantly boosted with the fusion framework. By replacing the visual submodule in FusionNet with VisualNet-TCNN, we have FusionNet--TCNN and FusionNet--TCNN, which achieve comparable and highest accuracy for classifying surface material.

Experiment Analysis To better understand the performance of our proposed scheme, we show the fragment classification confusion matrices of HapticNet, VisualNet and FusionNet- in Fig. 6, and the fragment classification accuracy is depicted in Fig. 8 as well. In the confusion matrices in Fig. 6, values between 0 to 1 are represented by colors varying from blue to red. Each row of the confusion matrices represents the probability of a material type being classified into the different 69 classes. Examining Fig. 6(a) and Fig. 6(b), we have following observations:

  • The off-diagonal misclassification patterns follow different distributions, implying that HapticNet and VisualNet have quite different behaviors when classifying surface materials.

  • There are some material types which HapticNet usually cannot distinguish well. For example: type 12 (RoofTile) and type 13 (StoneTileVersion1), type 39 (FineArtificialGrass) and type 40 (IsolatingFoilVersion1), type 48 (FoamFoilVersion1) and type 64 (Leather). However, they are more distinguishable using VisualNet. Haptic/image samples of these materials are shown in Fig. 7(a)-(c).

  • Similarly, there are samples which VisualNet cannot distinguish but HapticNet can distinguish well. For example: type 13 (StoneTileVersion1) and type 14 (StoneTileVersion2), type 18 (CeramicPlate) and type 19 (CeramicTile), type 40 (IsolatingFoilVersion1) and type 52 (StyroporVersion1). Haptic/image samples of these materials are shown in Fig. 7(d)-(f).

(a) HapticNet
(b) VisualNet
(c) FusionNet-
Fig. 6:

The fragment classification confusion matrix (10-fold averages) of (a)

HapticNet; (b) VisualNet; (c) FusionNet-.
[width=0.45]Experiment_figures/material/13/haptic_5.png [width=0.45]Experiment_figures/material/12/1_rezied.jpg [width=0.45]Experiment_figures/material/13/1_rezied.jpg
(a) Type 12 VS 13
[width=0.45]Experiment_figures/material/40/haptic_2.png [width=0.45]Experiment_figures/material/39/1_rezied.jpg [width=0.45]Experiment_figures/material/40/1_rezied.jpg
(b) Type 39 VS 40
[width=0.45]Experiment_figures/material/64/haptic_5.png [width=0.45]Experiment_figures/material/48/1_rezied.jpg [width=0.45]Experiment_figures/material/64/1_rezied.jpg
(c) Type 48 VS 64
[width=0.45]Experiment_figures/material/14/haptic_5.png [width=0.45]Experiment_figures/material/13/1_rezied.jpg [width=0.45]Experiment_figures/material/14/1_rezied.jpg
(d) Type 13 VS 14
[width=0.45]Experiment_figures/material/19/haptic_2.png [width=0.45]Experiment_figures/material/18/1_rezied.jpg [width=0.45]Experiment_figures/material/19/1_rezied.jpg
(e) Type 18 VS 19
[width=0.45]Experiment_figures/material/52/haptic_2.png [width=0.45]Experiment_figures/material/40/1_rezied.jpg [width=0.45]Experiment_figures/material/52/1_rezied.jpg
(f) Type 40 VS 52
Fig. 7: (a)-(c) show the material types that HapticNet cannot distinguish well, but VisualNet can distinguish, where (a) is Type 12 (RoofTile) VS 13 (StoneTileVersion1), (b) is Type 39 (FineArtificialGrass) VS 40 (IsolatingFoilVersion1), (c) is Type 48 (FoamFoilVersion1) VS 64 (Leather). (d)-(f) depict the material types that VisualNet cannot distinguish well, but HapticNet can distinguish, where (d) is Type 13 (StoneTileVersion1) VS 14 (StoneTileVersion2), (e) is Type 18 (CeramicPlate) VS 19 (CeramicTile), (f) is Type 40 (IsolatingFoilVersion1) VS 52 (StyroporVersion1).

Fig. 6(c) shows that the above mentioned samples which were confused when working with a single modality, are mostly correctly classified for hybrid input. The diagonal line is more close to 1 compared to Fig. 6(a) and (b). Different from Fig. 6(a) and (b), fewer off-diagonal values exist. FusionNet- hence efficiently removes most misclassifications. The fragment classification accuracies of every material type in Fig. 8 also demonstrate the effectiveness of haptic/visual fusion, as revealed by the fact that FusionNet- achieves stably high accuracy (vertical axis) when classifying all the 69 material types (horizontal axis).

(a) HapticNet
(b) VisualNet
(c) FusionNet-
Fig. 8: The fragment classification histogram (10-fold averages) of (a) HapticNet; (b) VisualNet; (c) FusionNet-.

Time profiling To demonstrate the efficiency of our algorithm, we further compare the time profiling of HapticNet/VisualNet with two related ‘CNN + sliding window’ approaches. For the sliding window approaches, we transfer the weight of HapticNet/VisualNet to two Convolutional Neural Networks which share the same layer settings with HapticNet/VisualNet, while the convolutional layer is replaced by a fully-connected layer. Clearly, with exactly the same weight, the sliding window approaches achieve the same fragment/max-voting accuracy as HapticNet and VisualNet. However, because FCN are able to avoid the redundant feature computation that exists in CNN + sliding window approaches, it should be faster. To validate that FCN based approaches like HapticNet/VisualNet are faster, we profile the average running time of HapticNet/VisualNet in comparison to CNN + sliding window approaches on a Nvidia GTX 860M graphics card. As shown in Table V, for haptic classification, the sliding window-based approach takes 195.4ms to run, while HapticNet only takes 20.5ms. Moreover, for visual classification, the sliding window approach takes 7903.4ms, while VisualNet only takes 154.3ms.

Haptic classification Average running time
Haptic CNN + sliding window 195.4 ms
HapticNet 20.5 ms

Visual classification
Average running time
Visual CNN + sliding window 7903.4 ms
VisualNet 154.3 ms
TABLE V: Profiling comparison between the proposed HapticNet / VisualNet and CNN + sliding windows approaches.

V Conclusion and Future Work

We introduce a surface material classification method which uses Fully Convolutional Neural Networks. For predicting individual haptic or image input, we apply FCN with max-voting framework. We then design a fusion network dealing with both haptic and image input. Experiments on the TUM Haptic Texture Database demonstrate that our proposed system can achieve competitive classification accuracy compared to the existing schemes at reduced complexity. In our future work, we are aiming at further extending the current FCN + max-voting and hybrid classification schemes to more input types (images, haptic acceleration signals, sound signals, further signals from other modalities) for further improving the flexibility and robustness of our system.


  • [1] A. Wang, J. Lu, J. Cai, T. Cham and G. Wang, Large-margin multi-modal deep learning for RGB-D object recognition, in Multimedia, IEEE Transactions on 17(11), (2015): 1887-1898.
  • [2] J. Tang, L. Jin, Z. Li, S. Gao, RGB-D object recognition via incorporating latent data structure and prior knowledge, in Multimedia, IEEE Transactions on 17(11), (2015): 1899-1908.
  • [3] L. Liu, and P. Fieguth, Texture classification from random features, in Pattern Analysis and Machine Intelligence, IEEE Transactions on 34, no. 3 (2012): 574-586.
  • [4] T. Ojala, P. Matti, and M. Topi, Multiresolution gray-scale and rotation invariant texture classification with local binary patterns, in Pattern Analysis and Machine Intelligence, IEEE Transactions on 24, no. 7 (2002): 971-987.
  • [5] Arvis, Vincent, Christophe Debain, Michel Berducat, and Albert Benassi, Generalization of the cooccurrence matrix for colour images: application to colour texture classification, in Image Analysis and Stereology 23, no. 1 (2011): 63-72.
  • [6] M. Varma and A. Zisserman, Texture classification: Are filter banks necessary, in

    Proc. IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 691-698, 2003.
  • [7] M. Strese, J.-Y. Lee, C. Schuwerk, Q. Han, H.-G. Kim and E.Steinbach, A haptic texture database for tool-mediated texture recognition and classification, in Proc. of IEEE HAVE, Dallas, Texas, USA, October 2014.
  • [8] Culbertson, H. and Unwin, J. and Goodman, B. E. and Kuchenbecker, K. J. Generating haptic texture models from unconstrained tool-surface interactions in World Haptics Conference (WHC), 2013 , pp. 295–300. 2013.
  • [9] Romano, Joseph M. and Kuchenbecker, Katherine J., Creating realistic virtual textures from contact acceleration data in IEEE Transactions on Haptics, pp. 109–119, April 2012.
  • [10] Romano, Joseph M. and Kuchenbecker, Katherine J., Should haptic texture vibrations respond to user force and speed in IEEE World Haptics Conference, June 2015.
  • [11] N. Landin, J. M. Romano, W. McMahan, and K. J. Kuchenbecker, Dimensional reduction of high-frequency accelerations for haptic rendering, in Haptics: Generating and Perceiving Tangible Sensations, Springer, pp. 79-86, 2010.
  • [12] M. Ji, L. Fang, H. Zheng, M. Strese and E.Steinbach,

    Preprocessing-free surface material classification using convolutional neural networks pretrained by sparse autoencoder (ACNN),

    in Proc. of Machine Learning for Signal Processing (MLSP), Boston, USA, September 2015.
  • [13] M. Strese, C. Schuwerk and E. Steinbach, Surface classification using acceleration signals recorded during human free hand movement, in Proc. of IEEE World Haptics Conference, Chicago, USA, June 2015.
  • [14] J. M. Romano and K. J. Kuchenbecker, Methods for robotic tool-mediated haptic surface recognition, in IEEE Haptics Symposium (HAPTICS), Houston, Texas, USA, February 2014.
  • [15] J. A. Fishel and G. E. Loeb, Bayesian exploration for intelligent identification of textures, in Frontiers in neurorobotics, vol. 6, no. 4, pp. 1-20, June 2012.
  • [16] Y. Gao, L. A. Hendricks, K. J. Kuchenbecker, T. Darrell, Deep Learning for Tactile Understanding From Visual and Haptic Data, arXiv preprint arXiv:1511.06065.
  • [17] A. Krizhevsky, I. Sutskever and G. E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in neural information processing systems (NIPS), pp. 1097-1105. 2012.
  • [18] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, in arXiv preprint 1409.1556 (2014).
  • [19] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich, Going deeper with convolutions, In arXiv preprint 1409.4842 (2014).
  • [20] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in Computer Vision and Pattern Recognition (CVPR), 2015.
  • [21] V. Nair, and G. E. Hinton, Rectified linear units improve restricted boltzmann machines, in Proceedings of the 27th International Conference on Machine Learning (ICML), 2010.
  • [22] S. Wager, S. Wang, and P. Liang, Dropout training as adaptive regularization, in Advances in Neural Information Processing Systems, pp. 351-359. 2013.
  • [23] Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.
  • [24] F. H. C. Tivive, and A. Bouzerdoum, Texture classification using convolutional neural network, in TENCON, 2006 IEEE Region 10 Conference, pp. 1-4. IEEE, 2006.
  • [25] G. Kylberg, Kylberg Texture Dataset v. 1.0., in External report (Blue series), 2011
  • [26] E. Hayman, B. Caputo, M. Fritz, JO. Eklundh, On the significance of real-world conditions for material classification, in Computer Vision-ECCV, 2004.
  • [27] L. G. Hafemann, An analysis of deep neural networks for texture classification, M.Sc. Dissertation, Retrieved from, 2014.
  • [28] M. Cimpoi, S. Maji, A. Vedaldi, Deep filter banks for texture recognition and segmentation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 3828-3836.
  • [29] V. Andrearczyk, P. F. Whelan, Using filter banks in convolutional neural networks for texture classification, arXiv preprint arXiv:1601.02919, 2016.
  • [30] N. Liu, G. Gimel’farb, P. Delmas, High-order MGRF models for contrast/offset invariant texture retrieval, in Proceedings of the 29th International Conference on Image and Vision Computing, 2014.
  • [31] X.Qi, C. G. Li, G. Zhao, X. Hong, M. Pietikäinen, Dynamic texture and scene classification by transferring deep image features, in Neurocomputing, 2016, 171: 1230-1241.
  • [32] X. Lu, Z. Lin, H. Jin, J. Yang and J. Z. Wang, Rating Image Aesthetics Using Deep Learning, in Multimedia, IEEE Transactions on 17(11), (2015): 2021–2034.
  • [33] C. Shen, X. Huang and Q. Zhao, Predicting eye fixations on webpage with an ensemble of early features and high-level representations from deep network, in Multimedia, IEEE Transactions on 17(11), (2015): 2084–2093.
  • [34] Abdel-Hamid, Ossama, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu, Convolutional neural networks for speech recognition, in Audio, Speech, and Language Processing, IEEE/ACM Transactions on 22, no. 10 (2014): 1533-1545.
  • [35] O. Abdel-Hamid, A. Mohamed, J. Hui, and G. Penn, Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition, in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pp. 4277-4280. IEEE, 2012.
  • [36] T. N. Sainath, A. Mohamed, B. Kingsbury and B. Ramabhadran, Deep Convolutional Neural Networks for LVCSR, In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 8614 - 8618. IEEE, 2013.
  • [37] P. Swietojanski, A. Ghoshal and S. Renals, Convolutional Neural Networks for Distant Speech Recognition, in Hands-free Speech Communication and Microphone Arrays (HSCMA), 2014 4th Joint Workshop on, pp. 172-176. IEEE, 2014.
  • [38] O. Abdel-Hamid, L. Deng and D. Yu, Exploring convolutional neural network structures and optimization techniques for speech recognition, in INTERSPEECH, 2013, ISCA.
  • [39] C. Kereliuk, B. L. Sturm and J. Larsen, Deep learning and music adversaries, in Multimedia, IEEE Transactions on 17(11), (2015): 2059–2071.
  • [40] Q. Kong, X. Feng and Y. Li, Music genre classification using convolutional neural network, Retrieved from
  • [41] J. Schluter and S.Bock, Improved musical onset detection with Convolutional Neural Networks, in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pp. 6979-6983. IEEE, 2014
  • [42] J. Schluter, and B. Sebastian, Improved musical onset detection with convolutional neural networks, in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pp. 6979-6983. IEEE, 2014.
  • [43] M. Oquab, B. Leon, L. Ivan, and S. Sivic, Learning and transferring mid-level image representations using convolutional neural networks, in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pp. 1717-1724. IEEE, 2014.
  • [44] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng and T. Darrell, DeCAF: A deep convolutional activation feature for generic visual recognition, in arXiv preprint 1310.1531 (2013).
  • [45] M. D. Zeiler and R. Fergus, Visualizing and understanding convolutional networks, in European Conference on Computer Vision (ECCV), 2014, Springer International Publishing, pp. 818-833.
  • [46] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick and et. al., Caffe: Convolutional architecture for fast feature embedding, in Proceedings of the ACM International Conference on Multimedia, ACM, pp. 675-678, 2014.
  • [47] J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, in Journal of Machine Learning Research, 2011.
  • [48] D. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980.
  • [49] R. Girshick, J. Donahue, T. Darrell and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in Computer Vision and Pattern Recognition, 2014 IEEE Conference on, pp. 580–587. IEEE, 2014.
  • [50] S. Ren, K. He, R. Girshick and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, in Advances in Neural Information Processing Systems, 2015 Conference on, pp. 91–99. 2015.