Grouping a set of objects in an unsupervised way that objects in the same group (called a cluster) are more similar to each other than these in other groups (i.e., object clustering) has attracted a lot attention in both academic and industrial communities in the past decades. Most current object clustering works [1, 29, 27, 28, 24, 4]
aim at recognizing “similar behavior” based on visual information captured by a visual camera (e.g., RGB or Depth camera) or represented by different description methods (e.g., SURF, LBP or deep features). These above methods have been successfully applied into statistics, computer vision, biology or psychology[23, 10, 11, 18, 20].
However, most existing object clustering works ignore one of the important sensing modality, i.e., tactile information (e.g., hardness, force, and temperature), which casts a light in compensating visual information on many practical manipulation tasks [16, 26]. For example, in the practical situation that a robot grasps an apple, the visual information of the apple becomes unobservable due to the occlusion of a robot hand while the tactile information can be easily obtained. Some objects whose appearance are visually similar can be hardly distinguished via merely using visual information (e.g., ripe versus unripe fruits). However, the ripe versus unripe fruits can be easily distinguished by tactile properties (e.g., hardness). Besides, some objects cannot be well distinguished only by either visual information or tactile information. For instance, it is hard to differentiate three visually similar bottles, where two bottles are empty and the remaining one is full of water. Hence, it is beneficial from each other to perform object clustering by fusing visual and tactile modalities.
To integrate visual with tactile information, a naive solution is to treat visual or tactile data as single view data, and directly perform the existing multi-view clustering methods on the visual-tactile object clustering task. However, the gap between visual and tactile modalities is very large . On the one hand, the devices which are used to collect tactile and visual data are different. Tactile sensor obtains tactile data through constant physical contact, while the visual modality can simultaneously generate multiple different features of an object at a distance. Moreover, the format, frequency and receptive field is diverse since visual sensor usually perceives color, global shape and rough texture, while touch sensor is usually used to acquire detailed texture, hardness and temperature. Therefore, how to establish a novel visual-tactile fusion object clustering model, which can tackle intrinsic gap challenge across visual and tactile data, is our focus in this work.
To address the challenges mentioned above, in this paper, we propose a deep Auto-Encoder-like Non-negative Matrix Factorization (NMF) framework for visual-tactile fused object clustering. More specifically, deep NMF constrained with an under-complete Auto-Encoder-like structure is adopted to learn the hierarchical semantics, while preserving the local data structure among visual and tactile data in a layer-wise manner. Then, we introduce a graph regularizer to reduce the differences between similar points inside each modality. Furthermore, as a non-trivial contribution, we carefully design a sparse consensus regularizer to tackle the intrinsic gap problems between visual and tactile data. We explore a consensus constraint to interact the individual component between different modalities with final consensus representation to align two modalities. Thus, it plays as the modality-level constraint to supervise the generation of a common subspace, in which the mutual information on visual and tactile data is maximized. To optimize our proposed framework, an efficient alternating minimization strategy is present. To the end, we conduct extensive experiments on public datasets to evaluate the effectiveness of our framework, wherein ours outperforms the state-of-the-arts. The contributions are summarized as:
We propose a deep Auto-Encoder-like Nonnegative Matrix Factorization framework for visual-tactile fusion object clustering. To our best knowledge, this is a pioneering work to incorporate visual modality with tactile modality in the object clustering task.
We develop an under-complete Auto-Encoder-like structure to jointly learn the hierarchical semantics and preserve the local data structure. Meanwhile, we design a sparse consensus regularization to seek a common subspace, in which the gap between visual and tactile modalities is mitigated and the mutual information is maximized.
To solve our proposed framework, an efficient solution based on an alternating direction minimization method is provided. Extensive experiment results verify the effectiveness of our proposed framework.
The work in this paper lies in the tasks of visual-tactile sensing and multi-view clustering. We thus introduce the related work including visual-tactile sensing and multi-view clustering in this section.
Vision and touch are the most important sensing modalities both for robots and humans, and they are widely-applied in robot tasks [6, 16, 26, 3]. Generally, visual-tactile sensing can be mainly divided into three categories including object recognition, 3D reconstruction and cross-modal matching.
Amongst the fields mentioned above, Liu et al. propose a visual-tactile fusion framework to recognize household objects based on kernel sparse coding method 
. Yuan and Luo et al. propose a deep learning framework for clothing material perception by fusing visual and tactile information. Ilonen et al. develop to reconstruct 3D model of unknown symmetric objects by fusing visual and tactile information . Wang et al. present to perceive accurate 3D object shape with a monocular camera and a high-resolution tactile sensor . Yuan et al. propose a multi-input net to connect the visual and tactile properties of fabrics . Li et al. introduce a conditional generative adversarial network based prediction model to connect visual and tactile measurement 
. Although the previous models have been successfully applied in supervised learning in the visual-tactile sensing fields, its application in object clustering is still under insufficient exploration.
Multi-view clustering has shown remarkable successes in many real-world applications. Based on standard spectral clustering, co-training  and co-regularizer  are performed to enforce consistence of different views. Based on the subspace clustering strategy, Cao and Zhang et al. try to capture complementary information from different views in the manner of subspace representations [1, 27] . Based on the framework of non-negative matrix factorization and its variants , Li et al. propose a consensus clustering and semi-supervised clustering method based on Semi-NMF . Zhao et al. propose a deep Semi-NMF method for multi-view clustering .
The Proposed Method
where is the input feature matrix, is the basis matrix and is the compact representation, respectively. We can obtain the final clustering result by performing standard spectral clustering  on . However, in real-world applications, it is not enough to learn intrinsic data structure with single-layer NMF due to complex data structure and data noise. Zhao et al. show that a deep NMF model has an appealing performance in data representation . The deep NMF can be formulated as:
where and represent the basis matrix and representation for the -th layer, respectively. Inspired by this idea, we intend to explore the deep NMF architecture into our visual-tactile object clustering framework.
The Proposed Framework
In the setting of visual-tactile fusion object clustering framework, we use as the input data, where is the number of modalities ( is defined as for the visual-tactile clustering task in this work), and represents the -th modality. denotes the feature matrix for the -th modality, represents the dimension of the feature, denotes the number of data samples. Then, we propose our deep visual-tactile fused object clustering model as follows:
where is the number of layer, and are the regularization parameters. represents the high hierarchical semantics of the -th modality.
Moreover, the first and second terms denote the NMF constrained by an under-complete Auto-Encoder-like structure, which is designed to learn the hierarchical semantics while preserving the local structure of the input visual and tactile data. The first term denotes an under-complete decoder process controlling the dimension of lower than and further force NMF to learn more salient features representation of . The second term denotes an encoder process which implicitly maintains the local data structure via recovering from . Furthermore, we have the following Remarks for the used regularization.
The graph regularization in the third term is designed to pull the similarities of nearby points inside each modality. denotes the graph Laplacian matrix for the -th modality, constructed in -nearest neighbor manner. By using the Eigen-decomposition technique on , i.e., , we obtain: , where . However, the process of collecting tactile or visual data is easily contaminated by environmental change, which leads to noise and outliers in the source data. Meanwhile, Frobenius norm is sensitive to the noises and outliers. We thus replace Frobenius norm by the
. However, the process of collecting tactile or visual data is easily contaminated by environmental change, which leads to noise and outliers in the source data. Meanwhile, Frobenius norm is sensitive to the noises and outliers. We thus replace Frobenius norm by the-norm, which can jointly remove outliers and uncover more shared representation across the nearby points inside each modality.
The last item is the consensus regularization, which is designed to tackle the intrinsic gap problem between visual and tactile data. This term directly measures the similarity between and in a utility way, where is the best mapping matrix to align to . After aligning to , the -norm constraint is to calculate the dissimilarity between and in an efficient way. Therefore, this term plays as a modality-level constraint and learn a project matrix , which projects into the common subspace . In this subspace, the mutual information on each modality is maximized, which ultimately contributes to the object clustering.
Then the objective function Eq. (3) is further reformulated as:
To efficiently solve the optimization problem Eq. (4), we propose a solution based on alternating direction minimization algorithm. To reduce the training time, we pre-train each layer to approximate the factor matrices and . For the pre-training process, we decompose the input data matrix by minimizing first, where and . Then we decompose as , where and . is the dimension of layer and is the dimension of layer 111The layer size for layer to is denoted as in this paper. Repeating the process until all layers have been pre-trained. Then each layer is fine-tuned by alternating minimization of the proposed framework in Eq. (4). Specifically, the update rules for each variable are as follows.
Update rule for :
With other variables fixed, we can have the following Lagrangian objective function:
where , and is set as when . Taking the derivative to zero and applying the Karush-Kuhn-Tucker (KKT) conditions, we can have:
This process converges because this is a fixed point equation. Then we obtain the update rule as:
where represents the element-wise product.
Update rule for :
By utilizing a similar proof as , we can formulate the update rule for as follows:
Update rule for and :
Solving these variables is a challenging problem since it is hard to directly get the explicit solutions. We thus introduce two auxiliary variables and to transform the optimization Eq. (4), and obtain the following objective function:
After converting Eq. (9) to an augmented Lagrangian function, we obtain the following expression:
where , and
are the Lagrangian multipliers,initialized with zero matrix;, and are the parameters for penalty; is the slackness variable to satisfy the non-negative constraint for . We then employ the alternating direction method of multipliers to solve this equation, and the update rules are as follows.
Update rule for : With other variables fixed, we can have the following Lagrangian objective function:
Taking the derivative respect of to zero, we obtain:
Since Eq. (12) is a standard Sylvester equation, it can be effectively solved by Bartels-Stewart algorithm.
Update rule for and : With other variables fixed except for , we can have the following Lagrangian objective function:
Taking the derivative to zero, we obtain the following update rule:
where denotes the Moore-Penrose pseudo-inverse.
Similarly, can be updated with the following rule:
Update rule for and : and are solved in a similar way as that to solve , and we thus obtain the following update rules. The update rule for is written as follows:
where is a diagonal matrix with the i-th diagonal element as . is the -th row of the matrix .
is the identity matrix.
The update rule for can be written as follows:
Until now, we have obtained all the update rules. We summarize the overall update process of the proposed framework in Algorithm 1.
After obtaining the optimized , we could obtain the final clustering result by performing a standard spectral clustering on .
For the computational complexity, our proposed model consists of two steps, i.e., the pre-trained stage and the fine-tuned stage. In order to simplify the analysis, we suppose that all the layers are with the same size of hidden units. In the pre-trained stage, the computational complexity , where is the number of modalities, is number of layers, is the layer size, is the feature dimension, is the number of samples and is the number of iterations to achieve convergence in the pre-training process. In the fine-tuned stage, the computational complexity is , where is the number of iterations. Thus, the total time complexity is .
In this section, we evaluate the performance of our proposed model via several empirical comparisons. We first provide the used datasets and experiment results, followed by some analyses about our model.
Extensive experiments are conducted on two visual-tactile fusion datasets and one benchmark dataset to evaluate our proposed model: 1) PHAC-2222http://people.eecs.berkeley.edu/ yg/icra2016 dataset: it contains color images and tactile signals of household objects. In this paper, we utilize all images and the first 8 tactile signals. 4096-D visual and 2048-D tactile features are extracted in a similar way as . 2) GelFoldFabric333http://people.csail.mit.edu/yuan_wz/fabric-perception.htm dataset: it contains color images and tactile images of kinds of fabrics. More details about this dataset can be found in . In this paper, we use the pre-trained VGG-19 net to extract 4096-D features both for tactile and visual images. 3) Yale444http://vision.ucsd.edu/content/yale-face-database dataset: it is employed to evaluate the performance of the proposed framework when the modality number of the input data is more than 2, which contains images of subjects. Similar to , three kinds of features (i.e., 3304-D LBP, 4096-D intensity, 6750-D Gabor) are extracted as different views.
Comparison Models and Evaluation
We compare our proposed framework with the following models including 7 multi-view baselines and 4 related single-view baselines. Related single-view clustering competitors: Vision (Touch) performs standard spectral clustering  on the visual (tactile) features; ConcatFea concatenates all features first and then carries out standard spectral clustering; ConcatPCA concatenates all the features and does PCA to project the concatenated features into a low dimensional subspace, then performs standard spectral clustering on the projected features; Multi-view clustering competitors: Co-Reg  enforces the number shape between different views via co-regularizing the clustering hypotheses; Co-Training  works on the hypothesis that the true underlying clustering would assign a point to the same cluster irrespective of the view; Min-D  creates a bipartite graph basing on the “minimizing-disagreement” idea; Multi-NMF  utilizes non-negative matrix factorization to seek the common latent subspace for multi-view input data; DiMsc  utilizes a diversity term to explore the complementary information of multi-view data; DNMF-MVC  proposes a deep non-negative matrix factorization framework to capture the mutual information of multi-view data; GLMSC  simultaneously seeks the underlying representation and explores complementary information of multi-view data.
Similar to [1, 29], six different metrics i.e., accuracy (ACC), normalized mutual information (NMI), Precision, F-score, Recall, adjusted rand index (AR) are adopted to evaluate the clustering performance. Higher value indicates the better performance for all metrics. We run all algorithms times and report the mean values along with standard deviations. Table 1 and Table 2 show the object clustering results on PHAC-2 dataset and GelFabric dataset, respectively. Table 3 shows the results on Yale dataset. BestSV performs standard spectral clustering on the features in each view and reports the best performance. For avoiding overfitting, the maximum number of iterations is set to 150 for all experiments.
From the presented results, we obtain the following observations: our framework achieves very competitive performance when comparing with all the competing models, which reveals the remarkable effectiveness of our framework in object clustering task. Specifically, the results shown in Table 1 and Table 2 reveal the importance of fusing visual and tactile information when comparing with the models using visual (or tactile) information alone. This observation also reveals that our framework is able to utilize the visual and tactile information more effectively, when comparing with state-of-the-arts. The results in Table 3 also reveal that our framework is not limited to the -modality (i.e., visual-tactile fusion) case, and it can be applied into other applications whose modality number is more than .
Ablation Study Convergence Analysis
In this subsection, we analyze the proposed framework from three perspectives. Firstly, we analyze the effectiveness of the proposed Auto-Encoder-like structure, graph regularization and the consensus regularization. Then, we analyze the parameter setting, followed by the convergence analysis.
Effectiveness of Auto-Encoder-like Structure, Graph Regularization and Consensus Regularization: Figure 2 presents the effectiveness of the used items. We can draw the following conclusion. Overall, “Ours” achieves the best performance revealing that all the regularization and the Auto-Encoder-like structure proposed in this paper contribute to learn the rich information between multi-modality data which further boost the performance of clustering tasks. Specifically, “AE” achieve better performance than “None” denotes that via the proposed Auto-Encoder-like structure which takes data local structure preservation into account could result better representation for the source data. “GR” achieve better performance than “None” reveal the effectiveness of the graph regularization which can pull the similarities of nearby points and remove outliers inside each modality. “CR” achieve better performance than “None” reveal that the proposed consensus regularization could fill the gap between visual and tactile data and ultimately boost the clustering tasks.
Parameter Analysis: To explore the effect of our used parameters, i.e, control parameters and and the layer size , we use PHAC-2 dataset in this subsection. Specifically, Figure 3 shows the influence of ACC and NMI results w.r.t. the parameter under different layer sizes. As can be seen, under three different layer sizes, the framework performs best both in ACC and NMI when is set as . We thus set as default in this paper. Figure 4 explores the parameter sensitivity of the proposed framework w.r.t. the parameter under different layer sizes. In this experiment, is set as . Notice that the framework perform best both in ACC and NMI when is set as . So is set as default. Figure 3 and Figure 4 also explore the influence of model performance w.r.t. the layer sizes. We find that the setting of always leads to best performance. When the layer size is small, the framework is insufficient to learn the rich information behind the input data. And when the layer size is too large, it might introduce undesirable noise. This might be the possible reason why red curves perform better (i.e, layer size is ) than the blue curves (i.e.,)and the green curves (i.e.,).
Convergence Analysis: Even though we have not proved that the proposed framework theoretically converges, we present the convergence property empirically in Figure 5. The objective value and ACC are plotted and we choose the default parameters, i.e., , and layer size = in this experiments. Notice that the objective value gradually decreases until it converges after iterations. ACC has two stages: in the first stage, ACC increases rapidly; in the second stage, ACC grows slowly and sightly bumps until reaching the best performance.
In this paper, we propose a deep Auto-Encoder-like NMF framework for visual-tactile fusion object clustering. By constraining the deep NMF architecture by an under-complete Auto-Encoder-like structure, our framework can jointly learn the hierarchical semantics of visual-tactile data and maintain the local structure of the source data. For each modality, a graph regularization is adopted to pull the similarities of nearby points and remove outliers inside each modality. To create a common subspace in which the gap between visual and tactile data is filled, a sparse consensus regularization is developed in this paper, while the mutual information amongst visual and tactile data is maximized. Extensive experiment results on two visual-tactile fusion datasets and one benchmark dataset confirm the effectiveness of our framework, comparing with existing state-of-the-art works.
-  (2015) Diversity-induced multi-view subspace clustering. In CVPR, pp. 586–594. Cited by: Introduction, Multi-View Clustering, Comparison Models and Evaluation, Comparison Models and Evaluation.
-  (2005) Spectral clustering with two views. In ICML Workshop, pp. 20–27. Cited by: Comparison Models and Evaluation.
-  (2019) Semantic-transferable weakly-supervised endoscopic lesions segmentation. In ICCV, pp. 2304–2310. Cited by: Visual-Tactile Sensing.
-  (2020) Lifelong spectral clustering. In AAAI, Cited by: Introduction.
-  (2016) Deep learning for tactile understanding from visual and haptic data. In ICRA, pp. 536–543. Cited by: Experimental Setting.
-  (2014) Three-dimensional object reconstruction of symmetric objects by fusing visual and tactile sensing. IJRR 33 (2), pp. 321–341. Cited by: Visual-Tactile Sensing, Visual-Tactile Sensing.
-  (2011) A co-training approach for multi-view spectral clustering. In ICML, pp. 393–400. Cited by: Multi-View Clustering, Comparison Models and Evaluation.
-  (2011) Co-regularized multi-view spectral clustering. In NeurlPS, pp. 1413–1421. Cited by: Multi-View Clustering, Comparison Models and Evaluation.
-  (2001) Algorithms for non-negative matrix factorization. In NeurlPS, pp. 556–562. Cited by: NMF Revisit.
-  (2017) Sparse subspace clustering by learning approximation ℓ0 codes. In AAAI, Cited by: Introduction.
-  (2017) Projective low-rank subspace clustering via learning deep encoder. In IJCAI, Cited by: Introduction.
-  (2007) Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization. In ICDM, pp. 577–582. Cited by: Multi-View Clustering.
-  (2019) Connecting touch and vision via cross-modal prediction. In CVPR, pp. 10609–10618. Cited by: Visual-Tactile Sensing.
-  (2011) Constrained nonnegative matrix factorization for image representation. TPAMI 34 (7), pp. 1299–1311. Cited by: NMF Revisit.
-  (2018) Robotic tactile perception and understanding: a sparse coding method. Springer. Cited by: Introduction.
-  (2016) Visual–tactile fusion for object recognition. TASE 14 (2), pp. 996–1008. Cited by: Introduction, Visual-Tactile Sensing, Visual-Tactile Sensing.
-  (2013) Multi-view clustering via joint nonnegative matrix factorization. In ICDM, pp. 252–260. Cited by: Comparison Models and Evaluation.
-  (2018) Multi-modal joint clustering with application for unsupervised attribute discovery. TIP 27 (9), pp. 4345–4356. Cited by: Introduction.
On spectral clustering: analysis and an algorithm. In NeurlPS, pp. 849–856. Cited by: Multi-View Clustering, NMF Revisit, Comparison Models and Evaluation.
-  (2019) Representative task self-selection for flexible clustered lifelong learning. ARKIV. Cited by: Introduction.
A deep semi-nmf model for learning hidden representations. In ICML, pp. 1692–1700. Cited by: Multi-View Clustering.
-  (2018) 3d shape perception from monocular vision, touch, and shape priors. In IROS, pp. 1606–1613. Cited by: Visual-Tactile Sensing.
-  (2013) Constrained clustering and its application to face clustering in videos. In CVPR, pp. 3507–3514. Cited by: Introduction.
Deep spectral clustering using dual autoencoder network. In CVPR, pp. 4066–4075. Cited by: Introduction.
-  (2018) Active clothing material perception using tactile sensing and deep learning. In ICRA, pp. 1–8. Cited by: Visual-Tactile Sensing.
-  (2017) Connecting look and feel: associating the visual and tactile properties of physical materials. In CVPR, pp. 5580–5588. Cited by: Introduction, Visual-Tactile Sensing, Visual-Tactile Sensing, Experimental Setting.
-  (2018) Generalized latent multi-view subspace clustering. TPAMI. Cited by: Introduction, Multi-View Clustering, Comparison Models and Evaluation.
-  (2018) Binary multi-view clustering. TPAMI 41 (7), pp. 1774–1782. Cited by: Introduction.
-  (2017) Multi-view clustering via deep matrix factorization. In AAAI, pp. 11108–1113. Cited by: Introduction, Multi-View Clustering, NMF Revisit, Update rule for :, Experimental Setting, Comparison Models and Evaluation, Comparison Models and Evaluation.