With the booming of deep learning, the community has witnessed significant strides in 3D object recognition over the last decade. Most of the existing 3D object recognition models were training on the dataset that consists of dense, clutter-free, canonicalized 3D data. It was proven in a recent study that
With the booming of deep learning, the community has witnessed significant strides in 3D object recognition over the last decade. Most of the existing 3D object recognition models were training on the dataset that consists of dense, clutter-free, canonicalized 3D data. It was proven in a recent study that[taghanaki2020robustpointset] most of the existing STate-Of-Art (STOA) models, including the ones proposed in [qi2017pointnet, qi2017pointnet++, uy2019revisiting, taghanaki2020pointmask, liu2019densepoint, li2018pointcnn, wu2019pointconv, liu2019relation, poulenard2019effective, you2018pointwise], perform significantly poorer on more challenging sets, where the data are with noise, missing parts, sparser points, etc. In stark contrast to those models, humans is capable of making reliable decisions with a limited sequence of exploratory movements that provide the most information for the task, regarding what the object is and what to expect from possible exploratory movements based on prior knowledge [klatzky1995identifying]. This capability of object recognition by a limited number of local tactile cues is defined as ‘haptic glance’ [rouhafzay2019object, klatzky1999haptic].
Robots are envisioned to replace humans for dangerous, inaccessible tasks [takayama2008beyond], and are applied in many real-life scenarios. They are of great piratical values, especially for visually unreachable areas exploration, where limited visual information could be obtained. Specifically, robots could be applied to locate, identify, and manipulate/interact with objects under the ground, river/drain (for under-river salvage) etc. with different well-chosen movement schemes as depicted in Fig (a)a. This type of robotic application is known as ‘haptic exploration’. The fundamental bottleneck within the haptic exploration framework resides in the fact that 1) only a limited number of points could be collected, especially in our case when only one tactile sensor is utilized, which ends out to sparse 3D representations; 2) not all the exploration trials successfully reach a target object, which brings the noise to the collected data; 3) exploration may only provide partial information of the object, resulting in missing objects’ parts. In this study, to disentangle the aforementioned obstacles, and further endow the robot with the capability of ‘haptic glance’, we propose a novel reinforcement learning based framework to learn a sparse yet efficient 3D representation.
2 The Proposed Framework
2.1 Simulation of Target Use Case Scenario
Similar to the setup designed in [Fleer2020], an in-house simulator was developed to simulate the real-life robot exploration scenario and facilitate the training, testing procedures of the proposed framework. Concretely, a fixed robot was simulated to conduct haptic exploration via sequential tactile probes for 3D recognition to mimic a real-life robotic scenario, where limited visual data could be obtained. An example is depicted in Fig.(a)a. Within the simulated environment, 3D objects can be placed and accessed by a robotic hand via haptic probe. The robot hand is equipped with a tactile sensor that measures , as illustrated in (a)a, to send back pressure-sensation information regarding whether current exploration touches a object or not.
Each 3D point is represented by a three-dimensional coordinate , with the origin of the corresponding coordinate system set as elucidated in Fig. (b)b. Since we aim at recognizing possible invisible objects positioned under the ground, river/drain etc., each exploration point is constrained to reach the surface of the ground/river, i.e., . As such, for each exploration, the starting point is represented by this surface position and the corresponding orientation . Each is defined with the coordinate , where , and with 3 components .
In most of the practical cases, robots scan unexplored areas line-by-line, to ease the transition between the simulation and real-life robotic operation, in this study, haptic exploration was also conducted in a similar fashion. In another word, a certain constant value is first selected, and then the robot hand explores along the axis progressively. This further defines the line of all possible entry points, i.e., by projecting to the axis, which is named as the ‘probe line’ as highlighted in Fig. (b)b. is constrained to be negative, so that probes only fall in the areas, where possible 3D object is positioned. For each exploratory probe , when the sensor reaches an object, this touched point is then calculated within the Ray-casting system [pfister1999volumepro] regarding . In addition, the variable is set to one, when the sensor touches the object. 3D object could be placed at any initialized position with any rotations and , as illustrated in fig. 2
2.2 The Reinforcement Haptic Glance Framework
The overall diagram of the proposed reinforcement learning based haptic exploration framework is illustrated in Fig. 3. Each probe, i.e., haptic glance, during the haptic exploration could be parameterized by its position and orientation . In this study, the haptic control is composed of two parts, including (1) a low-level haptic explorer described by Gaussian distributions ; and (2) a high-level reinforcement learner that learns the aforementioned exploration parameters and guarantees the success of 3D recognition. Haptic control and 3D object recognition are achieved via 4 main modules including (a) the Point Cloud Representation Network (PCRN) with Per Point Context Aware Representation Blocks (P2-CARB), (b) the location network, (c) the classifier, and (d) the reinforcement learning scheme. Details of each module are given below.
. In this study, the haptic control is composed of two parts, including (1) a low-level haptic explorer described by Gaussian distributionsand , which are parameterized by
; and (2) a high-level reinforcement learner that learns the aforementioned exploration parameters and guarantees the success of 3D recognition. Haptic control and 3D object recognition are achieved via 4 main modules including (a) the Point Cloud Representation Network (PCRN) with Per Point Context Aware Representation Blocks (P2-CARB), (b) the location network, (c) the classifier, and (d) the reinforcement learning scheme. Details of each module are given below.
(a) The PCRN: the objective of the framework is to iteratively sample points via haptic exploration in order to classify a target object among possible object categories. Formally, for each step between 1 and , the framework first cast a probe request (, ), and the simulator returns a 3D point (). Each new probe request is stored in a probe request sequence . Additionally, each corresponding point sample is stored in the collected points sequence . After each probe, the two sequences are embedded into one mutual representation space by employing the P2-CARB. P2-CARB is a permutation invariant network that is based on the PointGrow Context Aware (CA) with Self-Attention Context Awareness (SACA) operation [sun2020pointgrow]. In concrete words, after taking
feature vectors as input, CA operation outputsfeature vector. Each output vector contains information of all the previous collected points aggregated by mean pooling, i.e., a permutation invariant operation. SACA is a self-adaptive version of CA, allowing to output, for each point, a weighted aggregation of previous collected points
. The self-attention weights are learned by a Multi-Layer perceptron (MLP), taking a concatenation of point features and context aware features as input.Thus, SACA-A operation is adapted in our framework. The P2-CARB is detailed on the right top of Fig. 3.
Each MLP layer represents a set of MLP sharing the same weights, where each MLP processes its own input point independently. Each MLP is constructed with a series of two fully connected layers. Intermediate context aware representations of probe requests sequence and collected points sequence are extracted subsequently with 2 respective P2-CARBs. These representations are then concatenated and fed into the corresponding MLP, which allows to learn the mutual context aware representation of requests and relevant pre-collected points. The whole PCRN process is summarized in Fig. 3, where each intermediate feature embedding dimensions are highlighted in green. The probe request sequence is of size , and is constrained to be a constant while is constrained to be 0. Then, remains as the only variable to be controlled. Along with the three components of orientation, the request vector is of shape of 4. Taking the mutual representation as input, the classifier seeks to classify the object correctly, and the location network aims at predicting the next location to be probed so that the classification accuracy of the next iteration could be optimized.
(b) The Location Network is designed based on PointGrow [sun2020pointgrow] to iteratively predict the next postion to explore by computing the conditional distribution regarding all the previous generated points. Particularly, each new probe request is generated as a conditional distribution of previous requests associated to their corresponding collected points. For each probe request
, each component of the request is also generated as a conditional probability of thecomponents of the previous probes using the masking mechanism illustrated in the right bottom side of Fig. 3:
As mentioned previously, the low-level haptic explorer is parameterized by . For the sake of readability, we simply denote and as , and similarly, and as . For each component of the probe request, feature vectors are extracted and aggregated by mean pooling, which are used to predict the corresponding and parameters. Afterwards, the predicted and are further activated using the tanh and sigmoidactivation function respectively. Hence, with the and , a stochastic prediction could be made for each component of the next probe request.
(c) The 3D object Classifier: two versions of the classifiers have been developed using the PCRN representation, as represented in Fig. 4. The first module, namely PCRN-FC , aggregates the information of all the probes (all the collected 3D points) by mean pooling and further feed them to a series of two Fully Connected (FC) layers for 3D object classification. The PCRN-N-class version is constituted of classifiers, where is the number of probe/glance performed during exploration. With the designed ‘probe mask’, i.e., an 1/0 matrix that output only the information of current glance, each classifier is dedicate to classify 3D objects regarding the corresponding probe.
(d) the Reinforcement Learning Scheme: we adapt the algorithm from [Willia1992] to train our framework, which allows the joint learning of sequential haptic exploration and the efficient sparse 3D representation of an object by optimizing the expected cumulative reward :
where is the policy that predicts the next action to perform regarding the current state . In this study, the next action is considered as the next probe request, where the policy is defined based on the Gaussian distributions that specify all the elements of the probe . By this means, the policy is parameterized directly by and values outputted by the location network. Since these values are computed directly from PCRN, the optimization of the policy facilitates the learning of the mutual representation of probe requests and collected points sequences. At each time step, the framework tries to classify the object in the ground. It is then rewarded by if the classification is correct and by otherwise. By doing so, the point cloud representations is expected to be trained to highlight inter-point relations that are discriminatory among the available objects classes. With an analogous recipe, the location network is trained to identify new points leading to a sparse yet efficient 3D representation that help in discriminating the object. The policy gradient is defined as:
where , and is the discounted factor that weights the rewards w.r.t their time distance from the current state. In this work, the log-policy gradients for each probe request component are given by:
To take the classification task into account in the optimization process, a categorical cross entropy loss is incorporated to the policy gradient to form a Hybrid Update Rule :
where is the learning rate, is a weight balancing the exploration and classification tasks. denotes the set of available objects. , if is the object in the space to explore, otherwise. indicates the probability that is the object presented, according to the classifier. is the reinforcement learning baseline [Willia1992] outputted by the framework as shown in the left-bottom part of Fig. 3. This term reduces the variances of
policy gradients variance.
3.1 Experimental setup
Similar to the setup in [Fleer2020], we constructed a 3D dataset with 4 objects that appear frequently in real-life robotic haptic exploration scenario. The four 3D objects are shown in Fig. (a)a. The state-of-the-art haptic shape exploration model [Fleer2020] based on LSTM was readjusted to our application (simulator) by adapting the input/output. It is denoted as ‘LSTM’ in this paper, and considered as the baseline model. For fair comparisons, optimized hyper-parameters were employed. Each model was trained for 8000 steps, where each step is composed of a batch of 64 objects. Every object was randomly chosen between the four available categories with equal probability. The objects were randomly placed in simulation, with random rotations , and , within an accessible range. Sub-sets of positions/orientations ranges are kept apart as held-out test set for the evaluation [Fleer2020].
The overall results in terms of averaged accuracy across all the objects are summarized in Table 1, where the classification with different number of probes were reported. In this work, only 10 probes/glances were considered as done in [Fleer2020]. As observed, both the proposed PCRN-N-class and PCRN-FC outperform the baseline at the probe. PCRN-FC is superior to the other the two models in terms of classification accuracy at each probe. It is showcased that the proposed framework with classifiers achieves higher accuracy at the final probe by selecting only the top key 3D points with the ‘probe mask’, while the version with sequential FC layers achieves progressively better accuracy at each prob by using all the collected probe, but slightly worse performance at the last glance. They could be employed in different situations.
To further evaluate the performances of common 3D object recognition models under a similar haptic exploration scenario, we further construct a noisy sparse 3D objects set (same 4 objects) and tested STOA 3D object recognition models on it. More specifically, each object instance was obtained by randomly sampling 10 points from the original 3D objects with a certain percentage of noise. In this study, ‘noise’ was defined as point that does not fall on the 3D object, to mimic the real-life haptic exploration scenario. As verified in [taghanaki2020robustpointset], among the 10 tested STOA 3D recognition models, solely PointNet [qi2017pointnet] and DGCNN [wang2019dynamic] were relatively robust under the setting with noise, missing object parts, significantly sparser points etc. Therefore, these two models were tested and the results are reported in Fig. (b)b. It is demonstrated that, even though with noise, the performances (acc less than 0.79) of this two models are worse compared to the proposed models. Moreover, their performances drop significantly with the increase of noise rate.
In this study, we propose a novel reinforcement learning framework that enables robot for the haptic glance, i.e., conduct 3D object recognition with sparse yet efficient representation. According to the experimental results, existing 3D object recognition models fail to perform decently for haptic exploration with noisy and sparse 3D data. Conversely, our models achieve decent accuracy and surpass the state-of-the-art haptic exploration model.