Curved patch mapping and tracking for irregular terrain modeling: Application to bipedal robot foot placement

04/05/2020 ∙ by Dimitrios Kanoulas, et al. ∙ Istituto Italiano di Tecnologia NASA 0

Legged robots need to make contact with irregular surfaces, when operating in unstructured natural terrains. Representing and perceiving these areas to reason about potential contact between a robot and its surrounding environment, is still largely an open problem. This paper introduces a new framework to model and map local rough terrain surfaces, for tasks such as bipedal robot foot placement. The system operates in real-time, on data from an RGB-D and an IMU sensor. We introduce a set of parametrized patch models and an algorithm to fit them in the environment. Potential contacts are identified as bounded curved patches of approximately the same size as the robot's foot sole. This includes sparse seed point sampling, point cloud neighborhood search, and patch fitting and validation. We also present a mapping and tracking system, where patches are maintained in a local spatial map around the robot as it moves. A bio-inspired sampling algorithm is introduced for finding salient contacts. We include a dense volumetric fusion layer for spatiotemporally tracking, using multiple depth data to reconstruct a local point cloud. We present experimental results on a mini-biped robot that performs foot placements on rocks, implementing a 3D foothold perception system, that uses the developed patch mapping and tracking framework.



There are no comments yet.


page 2

page 4

page 6

page 7

page 10

page 11

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the past decades, bipedal robots have gained the capability to operate and locomote in well-structured environments. Still, their reliability is not guaranteed for real-world unstructured trails, where the legged robots need to contact uneven surfaces, under significant uncertainty. The profound impact that the Fukushima Daiichi nuclear disaster had on robotics in 2011, highlighted the need for robots to replace humans in hazardous tasks, such as climbing and searching rubble piles after a disaster [KS09] or operating in human-traversable rough terrain. The use of 3D visual perception can be crucial for completing tasks that require bipedal robots locomotion in uneven 3D terrain.

Fig. 1: Overview of the total curved patch mapping and tracking system, which splits in two parts: perception (top) and motion control (bottom).

Most of the current work in bipedal control, planning, and perception has focused on flat surfaces locomotion [CLCKHK05, MCKK2005, NCK12], where the environment is mostly known or well structured. Only recently, the work has been extended to outdoors rough terrain environments, where bipedal robots are gaining capability mostly without using 3D perception. The uncertainty is then tolerated either by low-level feedback control for blindly landing footholds [RBNP08, Wiedebach2016] or by using tactile or proprioceptive sensing. Thus, the problem of 3D perception for unmodeled sparse curved terrain is still largely open. In this work, we focus primarily on the perception part of bipedal locomotion for rough terrain foothold placement, using both exteroceptive and proprioceptive sensing, where non-point feet make contact with patches of the terrain. Notice, that quadrupedal/hexapedal locomotion, uses point-like contact modeling, since the feet are usually spherical. That differs from bipeds, which may need to have multiple (continuous or discrete) contact points with the environment.

This paper, introduces a novel method to detect and model sparse areas for potential contact, using curved patches that are spatially mapped in real-time around the robot. The system is divided into two parts (Fig. 1). In the perception part, the sparse surface modeling takes place. For this, we introduce: (1) a model to represent local contact surface areas of the same scale as surfaces on robot’s foot sole, (2) a fast method to find these areas, considering uncertainty in the data, and (3) a localization and mapping system of the detected contact areas, while the robot is moving. The second part, implements a simple control system, that is based on predefined motion sequences, which are matched to the detected patches around the robot. Our planning and control are basic, as we are focused mostly on perception; many other researchers have worked on motion for this type of task [MLB13, FMDWAMT15].

The main contributions of this paper are as follows. First, a new sparse contact surface representation is introduced, using a set of bounded curved patches that can model regions both in the environment and on the robot. A fast algorithm is developed, to fit these patches to 3D surface point cloud data, with quantified uncertainty both in the input points and in the output patch. Moreover, a residual, coverage, and curvature validation method takes place, to evaluate the quality and fidelity of the fitted patches. Last, a bio-inspired method is introduced, to sample salient contact patches in the environment, statistically similar to those selected by humans, when hiking in rough rocky terrain. The developed algorithms are part of a real-time mapping and tracking system, for patches around the robot, which are potentially good for contact. The framework is experimentally validated on the newly developed mini-biped robot, called Rapid Prototyped Biped (RPBP), for foot placement contacts on rocky surfaces.

Some earlier results of this work were presented in prior conference papers [VK11, KV13, KV14]. This paper, is the first presentation of the work as a whole, including also the patch tracking system (Sec. 5) and the hardware experiments with a bipedal robot (Sec. 6).

Next, we cover related work (Sec. 1.1) followed by a description of the input perceptual data (Sec. 2) and the sparse surface modeling using bounded curved contact patches (Sec. 3). We then, introduce the patch mapping (Sec. 4) and tracking (Sec. 5) system. Finally, we present experimental results (Sec. 6), including the task of foot placement on rocks, using a newly developed mini-biped robot, with a range camera and an IMU. The perception algorithms are developed and released as the Surface Patch Library (SPL) [SPL14].

1.1 Related Work

Bipedal locomotion advanced a lot during the DARPA Robotics Challenge (DRC) in 2015 [DRC-what-happened]. Prior to the DRC, only few bipedal robots were using on-line foot placement methods, based on visual/range sensing, especially for rough or uneven terrain.

Most of the early works, focused on flat indoor environments and usually were not real-time systems. RGB data were originally used in [CLCKHK05, MCKK2005] for 2D path planning, applied on the ASIMO humanoid robot. Similarly, in [GHB11, HDLB12], range sensing was used for a 2D graph-based footstep planner, implemented on the mini-bipedal robot NAO. Real-time obstacle re-planning methods, for dynamically changing flat 2D terrains, were also introduced in [KB16], tested on the NAO and HRP-4 robots. Climbing stairs/obstacles, was also one of the primary use-cases for bipedal systems. The full-size humanoid robot HRP-2, used stereo data for extracting planar surfaces to navigate on flat terrains and climb horizontal stairs [OII03, OOHI05]. The humanoid robot QRIO, achieved climbing on indoor sloping and elevated terrain, using range data [GFF08]. Similar concepts were studied in [CTSNKK09, NCK12], using 3D laser sensing. 3D perception for footstep planning was also used in [MLB13, UNOI2014], to allow a NAO robot navigate on surfaces with obstacles, either by avoiding or stepping on them. A multi-contact method was introduced in simulation for HRP-2 [BVKEK13], using point cloud data for planar contact reasoning. In [RGMSHS13, KMK13], point cloud data were used to avoid harsh foot impacts on a simulated HRP-2, using a KinectFusion-based [NIHMKDKSHF11] mapping system.

Exteroceptive-based perception (vision and range sensing) methods were developed during the DRC 2015, for footstep planning on flat uneven outdoors terrain. In [DT2014, FMDWAMT15]

, a real-time optimization-based state estimation system was introduced for the ATLAS humanoids robot, where stereo fusion was used for planning locomotion on rough planar terrain, by extracting planar contact surfaces for navigation. In a different direction, a graph-based footstep planner was used in 

[Kohlbrecher14, SKSC2016, KOB2016], using LiDAR/range environment data. Human supervision [SKCS14] was mostly necessary for adaptations in the foothold poses. Most of the aforementioned methods, localize footholds with full-surface contact between a flat surface and the robot foot sole. This is done by segmenting planes in the environment. In this work, we differ by modeling and localizing sparse curved patches for contact (potentially partial) between the legged robot and the environment. Notice, that Supraped, as introduced in [KC14], assumes only a single foot contact point with the terrain and was tested only in simulation, without the use of vision.

On-line footstep planning for quadrupeds and hexapods has also a significant history in legged robot locomotion. Even though they differ from bipedal research, since usually multi-legged platforms with spherical feet assume point-like contacts with the environment, compared to bipeds, where area-contact surfaces are required. Originally, many multi-legged robots used visual odometry and perception for obstacle avoidance and traversability analysis [WMBHRR10, SHG12], but not for planning 3D foot placement. Over the past decade, on-line perception systems for foothold detection were also introduced. For instance, dense surface models were used in [PMPKRB09, KKN09, KBPMS10] for LittleDog locomotion, while a local decision surface was used in [BPP10]

for a hexapodal robot. Moreover, perception was combined with supervised learning for footstep placement 

[KBPMS10], while visual SLAM systems, such as the Parallel Tracking and Mapping (PTAM), were used with elevation maps during locomotion [BS12]. These systems were not usually considering uncertainty in the foothold contact areas, and in some cases were not real-time. More recently, three impressive systems were introduced in [WFDHKS15, MHWCS15], and [KB2017], where the quadruped robots StarlETH and HyQ, and the mobile/legged hybrid quadruped robot Momaro, correspondingly, used real-time updated elevation maps for footstep planning.

Both exteroceptive and proprioceptive perception were used in legged robots to detect and localize footholds, either from a distance or during contact. Systems such as [CPP99] used only proprioception, while others such as [MHB12, BVKEK13], that were described above, used range sensing to drive high-level actions. Data fusion was used recently to deal with sensor noise and uncertainty, as in [WFDHKS15]. 3D range uncertainty plays an important role for foothold placement evaluation. Various methods were used in the literature to represent uncertainty, including both Gaussian [FAF86] and non-Gaussian [MS87] noise modeling for 3D stereo measurements. Recently, an uncertainty model for the Kinect sensor was introduced in [KE12], while a mixture of Gaussians was used in [DVX13]. In this work, we will consider 3D Gaussian noise for the input point cloud data, since it represents fairly good the point cloud uncertainty that is introduced from our range sensors (more details in Sec. 2).

A main contribution of this work is the introduction of a new environment model, for localizing potential contact surfaces for bipedal robots, using uncertain range sensing. Our approach is distinct from range image segmentation [HBJFBGBEFF96, PBJB98], in which the environment is partitioned into non-overlapping regions—our patches can overlap, but may enable better modeling without having to consider foot contact against more than a region at a time. In legged locomotion, usually dense approaches are used, such as grid or elevation maps [MLB13]. In these approaches, the planner evaluates the whole terrain to find areas for contact, which is often computationally expensive. There are also sparse approaches, but usually planar patches are searched for [FMDWAMT15], without considering uncertainty.

Another main result in our work, is an algorithm to fit bounded curved patches to noisy point cloud data. Even though plane fitting was studied extensively in the past, including uncertainty [WTHNY01, Kanatani05] for heteroskedastic range data [PVB09], quadrics can be a better option for natural rough terrain local contact modeling. Quadric fitting was studied in [Petitjean02, DNC07], but we are the first to combine it with quantified uncertainty, using geometrically meaningful parametrization (vs algebraic parametrization) and bounded patches.

To our knowledge, the idea we present in this paper, of mapping sparse curved patches around a biped for locomotion, was not studied before. Some early papers, used SLAM algorithms to map flat surfaces [MDR04, FMDWAMT15], where either the whole environment is segmented into planes or a local area around the robot. In our approach, we map a sparse set of patches, that have approximately the size of potential contact surfaces on the robot. For the mapping, we customized a version of the Moving Volume KinectFusion algorithm [RV12], that builds a dense 3D volumetric map of the environment around the robot, while simultaneously it is tracking the pose of the range sensor in it.

2 Input data: range and IMU sensing

Both exteroceptive and proprioceptive perception are essential for foothold placement, during legged robot locomotion in a real world environment. From the various sensors that have been introduced, e.g. in [Everett95, SNS11], we use range sensing to detect upcoming 3D contacts and inertial measurement unit (IMU) sensing to acquire the robot’s orientation relative to gravity. Next, we cover the data acquisition, including range data noise analysis and the IMU calibration process.

Fig. 2: (a) A Microsoft Kinect (top) and a Primesense Carmine 1.09 (bottom) RGB-D sensor, embedded with a CH Robotics UM6 9 -DoF IMU sensor; (b) a dense point cloud from a Microsoft Kinect RGB-D camera; (c) the stereo error modeling, visualizing the probability error ellipsoid (pointing error exaggerated for illustration); (d) the IMU calibration, with the gravity vector (yellow) and the ground plane normal (magenta) before (top) and after (bottom) calibration, which minimizes the angle between them.

2.1 Range Sensing

We focus on organized point cloud data [RC11] in the form of an image grid, acquired from a single point of view, either taken directly from a depth camera ( clouds in , either from a Microsoft Kinect or a Primesense Carmine 1.09, depending on the application’s range limits, as appears in Fig. 2-(a,b)) or indirectly (Sec. 5.1) from a volumetric map, where depth images are fused over space and time. The 3D point coordinates in the camera frame, where the -axis points to the right and the -axis down in the camera image, can be expressed as a function of the coordinates of the measurement ray direction vector through pixel and the range of the point along that vector as (Fig. 2-(c)):


The acquired point cloud data uncertainty can be either due to sensor inaccuracies or due to triangulation errors (the correspondence problem [SS02]

) and can be split into two categories: outliers and noisy inliers. Detecting and removing outliers is extensively studied and thus a set of real-time pre-processing filters, such as discontinuity-preserving bilateral filter, are applied first. Furthermore, as we will see in Sec. 

5, KinectFusion inherently ignores some outliers, when fusing data over time.

There are various ways to quantify the uncertainty of noisy inlier points, i.e. points that deviate from the underlying ground truth surface. In [VK11], we express the 3D point uncertainty using Gaussian modeling, with covariance matrices. To estimate these matrices, we experimented with constant, linear, and quadratic error model assumptions, but we preferred Murray and Little’s [ML05]

two-parameter error model for stereo disparity uncertainty, which experimentally appears to follow better the distribution of the input point cloud data. The error model is represented by two non-negative parameters: the variance of the pointing error of the measurement vectors

and the disparity matching error , both measured in pixels. The covariance matrix for a 3D point in physical units is


where is the stereo baseline (in physical units), is the disparity (in pixels), are the image pixel coordinates, and the depth camera focal length (in pixels). The error model parameters we used for the Kinect are: px, px; the former is from [KM10], the latter we determined experimentally following [ML05]. As can been also seen in Fig. 2-(c), range data exhibit heteroskedasticity (non-uniform variance)—typically, there is much more uncertainty in range than aim [PVB09, ML05] and the variance changes with range. Since the measurement rays usually have a single center of projection, the error ellipsoids for the sampled points are not co-oriented, because each is elongated in the direction of its own measurement ray.

Fig. 3: (a) Examples of eight (convex) paraboloid patch types, including the local coordinate frames and the concave variants shown inset. (b) The intrinsic and extrinsic parameters for the elliptic paraboloid and the circular plane patches.

2.2 Inertial Measurement Unit (IMU)

The use of proprioceptive Inertial Measurement Units (IMUs) for sensing the direction of gravity, is very useful for locomotion. Using a CH Robotics UM6 -DoF IMU, mounted on the top of our range sensors (Fig. 2-(a)), we receive IMU data, spatiotemporally coregistered with the RGB-D data, received from the depth sensor. Temporal registration of the RGB, depth, and IMU datastreams is based on timestamps. Spatial registration of the RGB and depth data is based on manufacturer hardware calibration and image warping, implemented in the hardware driver.

Our depth camera and IMU are rigidly attached together. Precise spatial registration of the depth and IMU data uses a calibration algorithm, that we implemented to calculate the rotation of the UM6 relative to the range sensor, from a dataset of depth images of a flat horizontal surface (a flat floor) and the corresponding UM6 orientation data. The gravity vector for the UM6 is calculated directly from its orientation data. We pair each gravity vector with the corresponding one in the depth camera coordinate frame, estimated as the downward facing normal of the dominant plane. For all these pairs of gravity vectors we solve the orthogonal Procrustes problem [ELF97], that gives the best-fit IMU-to-depth-camera rotation (Fig. 2-(d)).

3 Surface Modeling using Curved Patches

Modeling the shape and pose of contacts for locomotion on an unstructured trail, is still an open problem in robot vision. Inspired by natural systems like humans, we introduce a set of patch models, to sparsely model potential contacts. We first review the patch modeling, fitting, and validation process that we previously introduced in [VK11, KV13, KV14].

3.1 Patch Modeling

surface bound parameters DoF curvature/boundary
intrin. extrin. parameters
elliptic parab. ellipse 10
hyperbolic parab. ellipse 10
cylindric parab. aa rect 9
circular parab. circle 7 ,
plane ellipse 8
circle 6 ,
aa rect 8
c quad 11
sphere circle 7 ,
circular cylinder aa rect 9
TABLE I: The 10 bounded curved patch types.

We introduce a set of ten curved/flat patch types (Table I; eight of them are shown in Fig. 3-(a)), suitable to model rough surfaces and to balance expressiveness with compactness of representation: eight are first and second-degree polynomials—paraboloids and planes. Two are non-paraboloids – spherical and cylindrical – to better approximate smooth surfaces, particularly on man-made structures (e.g. pipes and railings). We pair each surface type with a specific boundary curve to capture useful symmetries and asymmetries. This particular system is not the only possible taxonomy, but it reflects our design choices in an attempt to balance expressiveness vs minimality.

3.1.1 Extrinsic and Intrinsic Surface Parameters

Fig. 4: (a) Automatic fits in red for all patch types, with simulated noisy range samples in blue; the “side-wall” effect reparameterization, shown inset (see [KV13]). (b) A set of 21 patches, manually chosen and automatically fit to the samples taken from a rock model, using an RGB-D Kinect sensor.

We parametrize a patch with a vector of real parameters that describe the shape (curvature and boundary intrinsic

parameters) and 3D rigid-body pose (

extrinsic parameters). The parametrization should be done in a way that it is minimal and also enables separate uncertainty modeling for the intrinsic and extrinsic parameters, considering the continuous symmetry class of the patch [Srinivasan03]. Grassia showed in [Gra98] that a pose representation that is beneficial for iterative optimization methods (as our patch fitting described in Sec. 3.2) is:


where is a translation and is an orientation vector, giving an element of via an exponential map. We use Rodrigues’ rotation formula for the exponential map (Grassia used quaternions):


Algebraically, corresponds to a homogeneous rigid body transformation of :

The pose of a patch can be thus defined as the pose of a local coordinate frame relative to a world frame , where is a basis for and is its origin. We define the familiar functions that transform a point in to/from the corresponding in :


where (6) makes use of the inverse transform


The 6 DoF patch pose parametrization is defined from Eqs. (3)–(6). For the 5 DoF pose representation in the case of rotationally symmetric patches, we fix at 0 in (3)–(6):


We now, present the details of the elliptic and hyperbolic paraboloid patch models, which are two out of the ten proposed surfaces. Due to space constraints we must refer readers to [VK11] for further details on: (1) the intrinsic and extrinsic parametrization in the case of symmetries, (2) the methodology to avoid singularities, by a fast dynamic reparameterization, and (3) the circular cylindrical and spherical patch models.

3.1.2 Paraboloids

The second degree polynomial, that locally fits a smooth surface at a given point , is called a paraboloid. It is an one-sheet quadric, with a central point of symmetry () and two independent curvatures (principal curvatures of at ) in orthogonal directions (Fig. 3-(b)).

We define and to be the unit vectors in the directions of the principal curvatures in the tangent plane to at and the surface normal to at . In a world coordinate frame , is the origin of and is its local frame at  [HS02]. A general paraboloid can be parametrized by , , and , while with the use of the log map, i.e. , points can be transformed from to .

The implicit and explicit forms for the paraboloid in standard position in are:


where is a point on the patch in and parameters of the explicit form. Composing (9), (10) with (5), (6), one can describe in world frame :


where is the projection of onto the local frame plane:



We bound elliptic and hyperbolic paraboloid patches with ellipses in the plane of the local frame , axis aligned and centered at (Fig. 3). Defining the ellipse radii , the bounded patch is the subset of the full surface (9)–(12) where and satisfies:


Our boundary definitions for the cylindric and circular paraboloids, the rectangle, ellipse, circle, and convex quad bounded planes, as well as the spherical and circular cylindrical bounded patch models can be found in [VK11].

3.2 Patch Fitting

Fig. 5: (a) Residual evaluation, with an example of the Euclidean distance between a point and its projection on the patch and a plot with 1000 residuals of patches fitted randomly on the rock model in (Fig. 4-(b)). (b) Coverage evaluation, for all patch types in simulated data.

Having defined the patch models, we now present our method to fit patches (Fig. 4), by solving the non-linear maximum likelihood problem when: (1) data points are corrupted by heteroskedastic (nonuniform variance) Gaussian noise and (2) bounded paraboloid patches need to be fit, not just unconstrained general quadrics.

In [VK11]

we give a non-linear fitting algorithm, which is based on a variation of Levenberg–Marquardt iteration, called Weighted Levenberg–Marquardt (WLM). WLM minimizes a sum-of-squares residual, by optimizing the patch implicit and explicit parameters. The residual for an individual sample point is computed by scaling the value of the implicit form by the inverse of a first-order estimate of its standard deviation, which is derived in turn from a covariance matrix, modeling the sensor uncertainty for the point. Due to space constraints we must refer readers to the original paper for the full description of the WLM fitting process. Here we restrict to the case of elliptic and hyperbolic paraboloid fitting; the full algorithm for all patch types is given in 


The inputs of the algorithm are: (1) sample points , with covariance matrices (positive semi-definite), (2) the general surface type to fit {parab, plane, sphere, cylinder}, (3) the boundary type {ellipse, circle, rectangle, quadrilateral}, if , and (4) a boundary containment probability . The outputs are: (1) the combined vector of patch intrinsic and extrinsic parameters , where is the DoF of the patch type, and (2) the parameter covariance matrix , modeling the uncertainty of the patch.

For elliptic and hyperbolic paraboloids bounded by ellipses, and . gives the ellipse boundary radii; gives the patch curvatures, gives the patch orientation in space as an exponential map vector, and gives the location of the patch apex in space. are the intrinsic and the extrinsic parameters of the patch.

The algorithm proceeds in 2 stages, which include heuristics for avoiding local minima, when solving the non-linear system. The first three steps fit an unbounded surface; the rest are largely concerned with fitting the boundaries, which can include final resolution of the patch center and orientation, where the bounding shape breaks symmetries of the underlying surface. Principal Component Analysis (PCA) is applied to the 2D projected points onto the local

-plane, for finding the enclosing boundary. A closed-form solution [RVC02]

for the corresponding eigendecomposition problem using 1D and 2D moments to fit approximate boundaries is used.

3.2.1 Paraboloid Patch Fitting Algorithm

Fit Patch Surface

Step 1: Plane fitting

Fit a plane with linear least-squares, ignoring and set (perpendicular projection of on plane).

Step 2: Surface Fitting

With , , and from step 1 as initial estimates fit an unbounded paraboloid using WLM on Eq. (11).

Step 3: Curvature Discrimination

Refine the patch based on the fitted curvatures . The patch is an elliptic paraboloid if and only if , otherwise change the type of the patch and the fitting process as described in [VK11].

Fit Patch Boundary

Step 4: Initialize Bounding Parameters

Set for boundary containment scaling [Ribeiro04]. Project the data to using (13).

Set first and second data moments: , , , .

Step 5: Ellipse Boundary Fitting

Set .

The full algorithm in [VK11], has steps to deal with all our curved bounded patch types.

3.3 Patch Validation

Evaluating if a patch faithfully represents the surface after the fitting process, is a necessary task. For example, the iterative WLM part of the fitting algorithm may diverge or get stuck in a local minimum, or the underlying data may be cluttered and not cover a major part of the patch. We review three measures that we introduced in [KV13], based on: (1) residual and curvature, to evaluate the patch surface shape and (2) coverage, to evaluate its boundary (Figs. 56). Further experimental results in the fitting and validation are presented in Sec. 6.

3.3.1 Residual Evaluation

The patch residual (Fig. 5-(a)), measures the deviation between the sample points and the unbounded patch surface. The deviation is a result of noise in the data and/or local minima during the WLM fitting process. The residual is the Euclidean root-mean-square error (RMSE [PKPSTV93]) , between the sample points and their corresponding closest points on the patch:


where is the total number of points. Calculating the for each , is a computationally expensive problem and is solved either using Lagrange multipliers [Eberly1999] or an approximation such as Taubin’s [Taubin91].

The residual threshold , such that patches with are dropped, was determined experimentally, by fitting random patches of m radius on a rock dataset [KV13] and sorting them with respect to their residuals. The value m was selected to include approximately of the patches. This is not the only way to define , which could be a task dependent threshold. For instance, could be used to check if any small surface bumps protrude more than a desired amount. Fig. 6 (patches in purple), illustrates patches that pass and fail the residual evaluation.

3.3.2 Coverage Evaluation

An evaluation that takes into account the patch boundary is also needed to detect the cases that too many sample points are outside the patch boundary or too many areas in the boundary are not supported by data. In these cases, the patch may not faithfully represent the data. We handle this, by generating an axis-aligned cell grid of fixed pitch on the plane of the patch local frame (Fig. 5-(b) illustrates 2D-projected patches and cells that pass and fail the coverage evaluation).

Let (resp. ) be the number of data points, whose projections is in cell and inside (resp. outside) the projected patch boundary and the area of the geometric intersection of the cell and the projected patch boundary. We consider a cell to be bad if and only if


for thresholds and . We fix these thresholds relative to the expected number of samples in the ideal scenario, where sample points are evenly distributed inside the patch boundary:


where is the total number of sample points and is the area of the patch, approximated as the area inside the projected boundary. A patch is bad if and only if there are more than bad cells. We determine experimentally the threshold values, after fitting several patches of m radius to m, , , and . Fig. 6 (patches in green), illustrates patches that pass and fail the coverage evaluation.

3.3.3 Curvature Evaluation

There may be cases, where residual and coverage may pass, but the bounded patch does not represent the data correctly. This may happen either when the point samples are acquired from very curved surfaces or when the LM non-linear fitting gets stuck in local minima. For this reason a patch will be marked bad if and only if its minimum curvature is smaller than a threshold or its maximum curvature is bigger than a threshold . Like previously, we set these thresholds experimentally; for details see [KV13]. Fig. 6 (patches in yellow), illustrates patches that pass and fail the curvature evaluation.

Fig. 6: Red patches that passed the validations and yellow/purple/green patches that failed curvature/residual/coverage evaluation, respectively.

4 Spatial Patch Mapping

In the previous sections, we described the framework for: (1) modeling contact patches between a robot and local surfaces in the environment and (2) fitting them to 3D point cloud data. We now, describe a five-stage algorithm to find and spatially map potentially useful patches. Each stage is described as a batch operation, but the last three stages may also work in such a way that patches are added to the map incrementally until a time or space limit is met. The patch mapping is dovetailed with the patch tracking, introduced in the next section, for maintaining a spatially and temporally coherent map of nearby patches, as the robot moves through the environment:

  1. Acquire RGB-D and IMU Data (Sec. 4.1).

  2. Preprocess the Input Point Cloud (Sec. 4.2).

  3. Select Seed Points on the Surface (Sec. 4.3).

  4. Find Seed Point Neighborhoods (Sec. 4.4).

  5. Fit Patches to the Neighborhoods (Sec. 4.5).

A homogeneous spatial patch map (Sec. 4.6) is built, as follows. First, RGB-D and IMU data are acquired, as explained in Sec. 2 (Stage i), followed by data preprocessing, such as saliency filtering or decimation, depending on the application (Stage ii). Then, a set of seed points (Stage iii) and their -size neighborhoods (Stage iv) are found for fitting and validating patches (Stage v). We fix to match the size of the intended contact surface on the robot, such as the sole of the foot.

We first introduce the local volumetric workspace (or simply the volume), which will be the moving local working space around the robot.

Local Volumetric Workspace

For a moving robot in the environment, 3D point cloud and IMU data are acquired at relatively high frame rates (typically 30–100 Hz). Information fusion is desirable not only to remove redundancies over time, but also to have a terrain representation of areas that are sometimes occluded or obstructed by the robot itself. Only the area around the robot is required for local 3D fine contact planning—longer-range perception could be used separately for coarse trajectory planning and obstacle avoidance.

Thus, it is natural here to consider only the data in a moving volume around the robot. We define such a volume, as a cube with a volume coordinate frame at a top corner (Fig. 8-(ii)). We align its -axis point with the gravity vector, derived from the IMU data, and we let the and -axes point along the cube edges, forming a right-handed frame. In this way our local volumetric workspace is the same as the Truncated Signed Distance Function (TSDF) volume in moving volume KinectFusion [RV12]. At any time , the volume is fully described by: (1) its size (a predefined constant), and (2) its pose relative to the camera, with the following rigid body transformation from the camera to the volume frame:


where is a rotation matrix and a translation vector.

The volume pose relative to the environment, roughly follows along with the robot. Our strategy is to move it periodically, whenever it deviates from a nominal relative pose to the camera more than a distance or an angle threshold (Sec. 5).

4.1 Input Data Acquisition

In the first stage of patch mapping, RGB-D and IMU data are acquired, as described in Sec. 2:

  1. Acquire RGB-D and IMU Data

    1. Receive image from the depth camera and absolute orientation quaternion from the IMU. The depth camera may either be a physical sensor, like Kinect, returning images, or a virtual camera (implemented by raycasting in the TSDF) in the context of KinectFusion (Sec. 5), which typically has a lower resolution, e.g. . In the later case, the virtual camera may also have a different pose in the volume than the physical amera (e.g. synthetic birds-eye view).

    2. Convert to an organized point cloud in camera frame and to a unit gravity vector pointing down in camera frame.

4.2 Point Cloud Preprocessing

Fig. 7: (a) (i) Dense RGB-D and IMU data, for an outdoor rock; (ii) salient points in red, after applying DoN and DoNG filtering; (iii) illustration of the DtFP measure, for a m away fixation point; (iv) illustration of the DoNG and DoN measures, in the two embedded images. (b) Human selected patches, in RGB-D recordings.

Point cloud preprocessing usually depends on the task application requirements. In this paper, we focus on bipedal locomotion on rough terrain. Notice, that we restrict our filtering to have real-time performance (i.e. 30 Hz) and when points are required to be filtered out, we replace them with NaN values to maintain their organization (used later in optimized methods, e.g. for fast neighborhood searching).

  1. Preprocess the Input Point Cloud

    1. Remove “background” points, either by using a filter to threshold the -coordinate values in the camera frame or by setting the volume size appropriately and keeping only the points in it.

    2. Apply a discontinuity-preserving bilateral filter to to reduce outliers [PF06].

    3. Optionally, in the case that comes from a physical depth camera, downsample with a median filter, to create an auxiliary point cloud . If data from step 1 were acquired from a virtual camera, then .

    4. Create a new point cloud , by applying the hiking salience filter, on point cloud (Sec. 4.2.1).

Both the and point clouds are kept, since they are used in different steps of the algorithm. is used for neighborhood searching and for seed selection. Now, we describe the hiking saliency filter, that is applied in Step 6.

4.2.1 Hiking Saliency Filter

Fig. 8: The patch mapping system overview. (i) Point cloud raycasting, from the virtual birds-eye camera (in purple) and the real camera (cyan); (ii) the volume -plane, divided in an grid; (iii) one seed per cell, selected randomly, (iv) example of 10 neighborhoods of m radius, using the k-d tree and the triangle mesh structures; (v) patches, randomly fit to salient seeds.

In [KV14], we describe a real-time bio-inspired system for automatically finding and fitting salient patches, attempting to balance patch quality with sufficient sampling of the appropriate parts of upcoming terrain. Saliency has been extensively used in computer graphics (e.g. [LVJ05]), to describe perceptually important surfaces. We introduce three new saliency measures (Fig. 7-(a)), that involve aspects of patch orientation and location: Difference of Normals (DoN), Difference of Normal-Gravity (DoNG), and Distance to Fixation Point (DtFP). They relate to patches that humans commonly select as footholds, using either bio-mechanical properties, such as that humans tend to fixate about two steps ahead when they are walking on rough terrains, or statistical similar patches that we observed humans to select when hiking (Fig. 7-(b)).

Difference of Normals (DoN): This operator [LG12], relates to the irregularity of a surface at a point, by measuring the angle between the normals of fine vs coarse scale neighborhoods. Points with low DoN can be considered more salient for footfall selection. We used the foot size , for the coarse scale and , for the fine.

Difference of Normal-Gravity (DoNG): This operator, relates to the slope of a surface, by measuring the angle between the -neighborhood normal vector of each point and the reverse of the gravity vector , acquired by the IMU. Points with low DoNG can be considered more salient for footfall selection:

Distance to Fixation Point (DtFP): This operator, relates to the biomechanical property of humans to fixate approximately two steps ahead, when locomoting in rough terrain [Marigold08]. Such a fixation point is estimated and we consider points that are closer to it, as more salient.

These measures are useful to quickly identify good seed points, before fitting, in the following way:

  1. Preprocess the Input Point Cloud

    1. (DtFP) Using the properties that: points down and points right in camera frame, estimate the fixation point in camera frame

      where , are the distances down and forward, from the camera to the estimated fixation point.

    2. (DtFP) Initialize , as all the points in within an -ball region of interest of .

    3. (DoN/DoNG) Compute surface normals corresponding to , using integral images [HRDGN12]. The normal uses window size , where is the coordinate (depth) of point in camera frame and is the focal length of the depth camera in pixels; uses the half window size.

    4. (DoN) Remove from all points , for which

      where is the DoN angle thresholds, estimated from human-selected patches.

    5. (DoNG) Remove from all points for which

      where is the DoNG angle thresholds estimated from human-selected patches.

Estimating principle curvatures in real-time, on raw point cloud data, appears to be relatively inefficient. For this reason we do not apply curvature validation (Sec. 3.3.3) as a saliency measure in the pre-processing stage. In contrary, we apply it as a post-processing filtering, on patches that were fit and their curvatures, thus, was determined (Sec. 4.5).

4.3 Seed Selection

Selection of seed points, around which patches will be fit, is an important aspect of our mapping approach. We uniformly sample the candidate seeds , relative to a coarse grid, imposed on the (horizontal) -plane in volume frame. We first split the volume -plane into grid cells (we typically use , see Fig. 8-(iii)). Then, for each cell, we randomly pick up to points, for a total of seed points. If there are real-time constraints and only a subset of the seeds is needed, we sample first from the cells whose distance from the projected camera position onto the -plane is smaller (i.e. closer to the robot), until a time limit is reached. Moreover, we ignore any new seed points for a cell that already has at least patches, fitted to seeds within it (some of these may be already-existing patches as we describe next).

Notice, that when the volume moves in the physical space, the seeds need to be remapped to new cells. Since this process may move some prior seeds or patches out of the volume, we remove them also from the map, as described in Sec. 5. If more than patches are remapped into a cell, the extra patches can be culled, if desired. Fig. 8-(iii) illustrates seeds, with the volume divided into an grid (), with one seed point per grid () requested. The seed selection proceeds as follows:

  1. Select Seed Points on the Surface

    1. Split the volume -plane into grid cells.

    2. Project each point in , onto the -plane and find the cell it falls in.

    3. Project the camera location on the -plane and order the cells in increasing distance of their center to the projected camera point.

    4. For each grid cell in the order of increasing distance from the camera, randomly select new seed points from , until at most seeds are associated to the cell, while at most seeds are selected in total.

4.4 Neighborhood Searching

After the selection of seed points, their neighborhood in the original point cloud is found, to provide the local point data to which new patches will be fit. Many methods have been introduced for finding the local 3D points within distance (e.g. foot size) from a seed, for some distance metric. We have experimentally validated three search methods using: (1) K-D trees [Bentley75], (2) triangle mesh datastructures, and (3) back-projections on the image plane [RC11]. In [KV13], we evaluate the performance between the first two in terms of time complexity and their abilities to handle depth discontinuities; two examples are shown in Fig. 8-(iv).

We decided to use the image plane back-projection method, since it is faster than the other two methods and no extra data-structure is required. Given the 3D neighborhood-sphere around the seed point and the camera parameters, we can simply backproject it, as a circle in the image plane, centered at the seed’s pixel. The bounding square of pixels that the circle covers, can be easily extracted. For each one of these pixels, the Euclidean distance of the corresponding 3D points (if any) to the seed point is checked, to see if it is contained to the -sphere. For a given seed point, this search is linear in the number of checked pixels, which is . The neighborhood finding algorithm proceeds as follows:

  1. Find Neighborhoods of Seed Points

    1. For each seed point , find the -ball neighborhood in the organized point cloud , using the image plane backprojection method.

    2. For each neighborhood, keep uniformly distributed points. The parameter is selected depending on the processor speed and the required real-time performance (e.g. target number of patches to fit per second).

4.5 Patch Modeling and Fitting

For each point cloud neighborhood found around the seed points, we proceed with patch fitting and validation, as described in Sec. 3. After the patch fitting, the principle curvature can be used in a few different ways, depending on the shape of the robot foot sole. For instance, for a flat footed robot, concave regions, with more than slightly positive , could be considered less salient, because the foot cannot fully fit there. A robot with spherical foot sole, might prefer areas that are not too convex (as the foot would only make contact at a tangent point), but also not too concave to fit. Fig. 8-(v) illustrates patches that were randomly fitted to salient seed points. The patch fitting and validation proceeds as follows:

  1. Patch Fitting & Validation

    1. Fit a patch to each neighborhood (Sec. 3.2).

    2. Discard patches with Euclidean residual greater than (Sec. 3.3.1).

    3. Discard patches with areas not sufficiently supported by data, after applying the coverage evaluation (Sec. 3.3.2).

    4. Discard patches with min principal curvature less than or max principal curvature greater than (Sec. 3.3.3). This step could be adjusted depending on the application.

We estimated the following parameter values from observations of human selected patches [KV14]: , , .

Termination Criteria

For each data frame, one can keep updating the map with new patches until a time or number-of-patches limit (or any other task-specific criteria) has been reached. Usually we specify initially a desired fraction , of the total sampled surface area that should probabilistically be covered by patches. If , we sparsely sample the surface, while if , we oversample it. For instance, for an -ball neighborhood search and ellipse-bounded paraboloid patch fitting we can estimate the expected number of patches for this criteria as


When the robot is moving, we also require real-time performance and so impose a time limit on fitting new patches at each map update.

4.6 Homogeneous Patch Map

Fig. 9: The concept of the homogeneous patch map with a sparse set of patches both on the trail surface (green) and possibly also on the robot itself (brown), spatially mapped with quantified uncertainty.

Salient patches, from the algorithm proposed in this section, could form the basis for a homogeneous patch map; a dynamically maintained local spatial map of curved surface patches, suitable for contacts, both on and around the robot. Fig. 9

illustrates the idea, including both environment surfaces and contact pads on the robot itself (potentially uncertain, due to kinematic errors). We leave the modeling of patches on the robot, using kinematics and/or proprioception, as future work. The homogeneous patch map could provide a sparse “summary” of relevant contact surfaces for higher-level reasoning. Upcoming terrain patches can be detected from a distance, using exteroceptive sensing (with high uncertainty), while the map could be further refined, using for instance proprioception for less uncertain patches on the robot itself (e.g. heel, toe, foot sole), as the robot makes contact with the patches.

We envision classic elements of SLAM [SC86], to be applied to the introduced map. For instance, different observations of the same surface patch can be associated, using propagation of spatial uncertainty through kinematic chains. Optimal data fusion, by Kalman update, is also supported by the patch covariance matrices. First-order propagation of uncertainty through a chain of transforms, with covariances , is facilitated by the chain Jacobian , described further in [VK11]:


where is the covariance of the pose of a patch at the end of the chain, relative to the base.

Time Complexity

Stages I–III are , i.e. linear in the input. The implementation of Stage IV is (it could be improved to , by switching to breadth-first search on a triangle mesh, but we found the constant factors, favor the image backprojection method for neighborhood search, in practice). The time complexity of Stage V is dominated by for Step 13. Steps 14–16 are . The worst case time complexity for the whole algorithm is thus .

5 Patch Tracking

In the previous section, we described the algorithm that creates a patch map of the environment around the robot. During locomotion, patch tracking is also required to find and add patches to the map online, track them as the robot moves while new frames are acquired, and drop them when they are left behind. For tracking, the pose of the range sensor, with respect to the volume frame, is needed at every frame . This is a well-studied problem in the context of Simultaneous Localization and Mapping (SLAM) [DB06]. For a walking robot, a main issue is a potential shaking or jerky camera motion, during locomotion with intermittent contacts. We introduce a GPU-based method, for real-time camera tracking, including shaky and jerky motions, by extending the Moving Volume KinectFusion system [RV12, IKHMNKSHFDF11]. We review this system in Sec. 5.1 and we describe our adjustments for our walking robot system.

5.1 Moving volume KinectFusion review

The KinectFusion algorithm was introduced in [IKHMNKSHFDF11], for real-time 3D depth camera tracking and dense environment mapping, using a GPU. To describe the system, we extend the notion of the volume, introduced in Sec. 4, to a Truncated Signed Distance Formula (TSDF) volume [CV96]. A TSDF volume consists of voxels that represent a portion of the physical world, by the signed distance from a physical surface and a confidence weight, representing data reliability. A point cloud can be produced using either raycasting [PSLHS98] or the marching cubes method [LC87]. As new depth images are acquired, two main processes alternate:

  1. Camera Tracking: the Generalized Iterative Closest Point algorithm (GICP) [SHT09] updates the camera-to-volume transformation

  2. Data Fusion: the distance and confidence values are updated in all TSDF voxels, given the newly observed depth image

This system is ideal for our purposes, not only because it can reliably handle shakes during locomotion, but also because hole-filling and outlier rejection are some of its properties. The original system had the TSDF volume fixed in the physical space. Our Moving Volume KinectFusion system [RV12] extended it to a moving TSDF volume, using the intermittent moving policies, we described in Sec. 4. This is implemented by remapping (translating and rotating) the volume when needed, leaving the camera and the cloud fixed, relative to the physical world, but moving the TSDF volume to coarsely follow it. For legged locomotion, we would also like to have: (1) a task-specific way to set the inputs to the moving volume policies and (2) a point cloud around and under the robot’s feet, not only where the real camera is facing. To handle these two requirements, we modified the original system as follows.

Adaptations to moving volume KinectFusion

We use the gravity vector (from the IMU) to constrain rotation of the volume, so that the volume -axis remains vertical. This adaptation assumes that the robot is locomoting in a standing-like pose. We also modify the way that raycasted point cloud is computed from the TSDF. Instead of raycasting from the real-camera viewpoint, where points near the feet of the robot may not be visible, we do it from a virtual birds-eye view, as long as the TSDF volume voxels have already captured some surface information from previous frames. We define the virtual birds-eye view camera reference frame to be axis-aligned with the TSDF volume frame, but with its -axis pointing down (along the volume frame -axis). The center of projection of the virtual camera is at a fixed offset distance above the location of the real camera, and its width and height (in pixels) are set at fixed resolution, e.g. px. An example of such a virtual camera can be seen in Fig. 8-(i), where the point cloud covers the surrounding area around and under the robot. This allows patches to be fit under and around the feet, using data that was observed in prior frames.

5.2 Patch Mapping and Tracking Algorithm

First, we initialize the volume, such that its -axis co-aligns with the gravity vector, and it -axis co-aligns with the real camera’s forward vector. We set the initial volume location so the initial camera position is in the middle of the volume. We split the volume’s -plane into grid cells, as described in Sec. 4.3. For every new frame , the full patch mapping and tracking algorithm proceeds as follows:

  1. Data Acquisition

    1. Acquire a new frame of RGB-D and IMU data.

  2. Patch Tracking

    1. Update camera pose in the volume frame, from camera tracking.

    2. Update the TSDF volume voxels with the fused data.

    3. Remap (i.e. rigid body translation and/or rotation) the volume, according to the moving policies.

    4. If the TSDF volume was remapped:

      • Update each patch’s pose in the volume frame with the same rigid transformation.

      • Discard the patches that have moved outside the volume. Optionally, also discard patches that are further than behind the camera.

      • Update the association of existing patches to grid cells in the volume frame -plane.

  3. Patch Mapping

    1. Raycast a point cloud from the birds-eye view (virtual) camera in the TSDF.

    2. Find, fit, and validate salient patches (Sec. 4), until either a clock time limit is reached or the maximum number of patches are in the map.

Camera tracking is part of the KinectFusion implementation we use. We have found it to be quite robust, even in the presence of jerky motions. However, if camera tracking does fail, then the whole patch map is reset and patch mapping and tracking are re-initialized.

6 Experimental Results

Fig. 10: Physical hardware of the RPBP mini-biped and the walking table apparatus, with four rocks in fixed positions.

We performed two types of experiments to test the patch mapping and tracking system. First, we present experiments that test the bio-inspired perceptual framework, using data that were acquired by humans, who were carrying an RGB-D+IMU sensor rig. We include a quantitative comparison between our algorithm results and patches that humans tend to selected when locomoting.

To experimentally validate the patch mapping and tracking system for robot locomotion, we also developed a 12-DoF mini-biped robotic platform, called the Rapid Prototyped Biped (RPBP), shown in Fig. 10. We briefly describe its hardware and software design and then present experimental results, where the robot uses our patch mapping and tracking system to identify rocks in a laboratory setup and a simple motion control system to step on them.

6.1 Results From Human-Acquired Data

We tested patch fitting and mapping, on data acquired with a hand-held Kinect, from an outdoor trail. Acquisition was on an overcast day, so the sun did not overpower the sensor.

6.1.1 Patch Mapping Experiment

In a first experiment, we applied patch fitting, validation, and mapping, with random seeds on rock areas, that we selected from the data. We used random seeds instead of our seed saliency filter, to study how the fitting and validation algorithms behave in isolation. We let patch mapping run for each dataset, until patch areas covered of the surface, with patch radius m, residual threshold m, coverage cell size m, coverage threshold factors , and .

Qualitatively (e.g. Fig. 4), the fitting and validation algorithm appears to give a good representation of the underline surfaces. Quantitatively, we measured the following statistics (Table II): the total and valid number of patches, the number of dropped ones due to residual or coverage, the average Euclidean residual of the valid ones, and the total surface area for each dataset. We notice, that of the total fitted patches are valid after the evaluation and that the dropped ones, due to residual (depth discontinuities) and coverage (unevenly sample distribution), are equal. We also notice, that for a 5.43m average area, the residual error averages only 4.9 mm, which is a good indicator that the fitting process produces accurate curved patches to represent uneven local surfaces.

Dataset Patches Dropped patches Avg res Total area
total valid due to
due to
1 167 142 15 10 25 4.6 4.57
2 160 107 18 36 53 5.0 3.13
3 231 183 24 24 48 5.2 5.53
4 220 164 27 31 56 5.0 5.01
5 195 157 17 24 38 5.4 4.86
6 235 185 31 20 50 5.7 5.69
7 267 213 30 25 54 4.4 6.54
8 260 223 16 22 37 4.2 7.04
9 187 159 13 16 28 4.8 4.95
10 301 223 30 50 78 5.0 6.98
avg 222 176 22 26 47 4.9 5.43
TABLE II: Quantitative statistics of randomly seeded patch fitting

6.1.2 Human Comparison Experiment

As discussed in Sec. 4.2.1, we took video recordings of the footsteps for five humans, walking on trails while carrying an RGB-D+IMU sensor rig. We visually matched all these footholds to corresponding (pixel, frame) pairs in the RGB-D+IMU data (total 867), where we fit patches (Fig. 7). We then took statistics (min, max, median, average, and standard deviation) of these patches (labeled “man” in Table III) to set the parameters used in the DoN and DoNG measures. We ran the patch mapping algorithm to automatically fit salient patches in the same dataset and collected the same statistics (labeled “auto” in Table III). Qualitatively, we confirmed that: (1) there are no patches fitted further than 0.7 m from the fixation point, (2) there are no patches in areas with large slope, and (3) there are no patches with very large curvature. Quantitatively, the statistics for “auto” fit patches correspond to the “man”’s average (human-selected), plus (minus for ) . Across all 82,052 patches that fitted in data frames, about 1.4% of patches were dropped due to the curvature, residual, and coverage checks. This relatively low number, indicates that the saliency checks performed prior to fitting, improved the time efficiency in fitting good patches, compared to the results of Table II, where patches were seeded randomly.

measure min max med avg std
DoN () 0.00 26.94 4.17 4.95 3.45 man
0.00 15.31 2.83 3.65 2.85 auto
DoNG () 0.24 44.85 11.37 12.34 7.54 man
0.00 34.96 11.89 13.40 8.08 auto
(m) -19.07 13.04 -1.70 -1.91 3.89 man
-11.97 7.64 -0.87 -1.14 1.75 auto
(m) -7.87 28.94 3.78 4.62 5.02 man
-7.81 16.97 1.01 1.32 1.76 auto
TABLE III: Quantitative statistics of human selected vs. automatically fitted patches.

Computational Cost

On commodity hardware (one 2.50 GHz core, 8GB RAM) and a RGB-D image, preprocessing (bilateral filter and downsampling) run in 20 ms total. Normal computation, DtFP, DoN, and DoNG saliency take 35 ms. Neighborhood finding takes 0.03 ms per seed, and patch fitting and validation are 0.8 ms total per neighborhood, with . The total time elapsed per frame is , where is the number of patches actually added. can range from , in the case that the map is already full (or there are no new seed points), up to . In practice we additionally limit the total time spent per frame to e.g. , allowing up to around patches to be added per frame, in this configuration.

6.2 RPBP Robot and Software Interface

RPBP is a 3D printed mini-biped robot, which weighs 1.7 kg and it is cm tall. Its 6-DoF legs are kinematically similar to the DARwIn-OP humanoid [HTA11]. The actuators are the Robotics Dynamixel MX-28, with high resolution magnetic rotation sensors and PID control. We used the PrimeSense Carmine 1.09 depth camera, which is a sensor similar to the Microsoft Kinect, but with better close-range capabilities. We mounted a CH Robotics UM6 IMU sensor on top of it. The UM6 has a 3-DoF accelerometer, a 3-DoF gyroscope, and a 3-DoF magnetometer (9-DoF in total). The robot has a lightweight tether for off-board power and 3 communication channels to an external computer. We used RS-485 Dynamixel (DXL) communication for controlling the motors and USB 2.0 for the UM6 IMU and the Carmine 1.09 camera.

The software interface is implemented in C++ and is divided into the perception and control subsystems (Fig. 1). The perception subsystem builds on PCL [RC11] and includes: (1) a library we wrote that implements an RGB-D+IMU frame grabber, providing 30fps depth and 100fps IMU data, (2) our Moving Volume KinectFusion library, and (3) our Surface Patch Library (SPL) library [SPL14], that implements the patch mapping system, where salient patches fit in the environment. The perception system provides a set of patches to the control system, which includes our Robotis Dynamixel-based communication library and the RPBP walk controller. The walk controller maps from a small library of known patch positions relative to the robot to corresponding predefined step motions. At runtime, the walk control system finds matches between the patches from the perception system and those in the library, then executes the corresponding motion sequence, when an approximate match is found.

6.3 Application to Bipedal Foot Placement

Fig. 11: Top: Software interface for the RPBP robot. Bottom: RPBP detects a patch on rock 1 and places its foot, using the approximately matching predefined motion.

Bipedal locomotion in rough terrain is one of the most challenging tasks in robotics. In this paper, we study the case of non-point feet, where the robot uses the foot sole contact area to support torques for balance. In this section, we test our perception hypothesis, by integrating the proposed patch mapping and tracking framework into a real-time footfall selection system on the RPBP biped, for foot placement on some rocks. The focus of the experiments is perception and we thus use a very simple control system, where the robot executes a set of predefined leg motion primitives [MLB13], driven from the type of patch that is selected for contact. We thus manually train a library of patches and a motion sequence for each one of them.

Our apparatus (Fig. 10), includes a table with 4 solid rocks in fixed positions. The robot is always attached to a safety belay, which is not intended to affect its motion significantly, i.e. it is slack. We manually posed the robot to place its foot on four different types of patches on rocks, recording motion keyframes into a library of stored stepping motions. We then, placed it in front of one of the same rocks and let it create a patch map and find a match between the trained patches and one of those in the map. If a match is found, we let it run the corresponding trained motion sequence and place its foot on the rock. To our knowledge this is the first perception system that lets a biped place its foot on very rough terrain.

Foot Placement Training

During the training, we first let the robot be in its starting lookdown pose, as appears in Fig. 11. We place the robot in a pre-defined position in front of each of the four rocks and we let our KinectFusion system provide us with a point cloud from the virtual camera. For each rock, we manually select a neighborhood where we would like the robot to place its foot and we fit a patch. We then, train the robot to place its foot on the patch, with a corresponding motion sequence. This forms a library, where patches are stored relative to the robot and corresponding motion sequences. Similarly, we could train the robot to place its feet on various other positions, but this goes beyond the intention of this experiment, that focuses on perception, not control or trajectory planning.

Foot Placement Tests

We place RPBP in front of each rock in the lookdown pose. Then, we let the perception system create a patch map using seeds in the area around the robot’s feet, i.e. cm in front of the robot. A patch matching is then performed, by comparing every patch in the map with every trained one. The similarity comparison between two patches proceeds as follows:

  1. check if the patches are of the same surface type (flat or elliptic/hyperbolic/cylindric paraboloid)

  2. check if the absolute difference between their boundary parameters are smaller than a threshold m

  3. check if the absolute difference between the curvatures are smaller than a threshold m

  4. check if the angle between their normal vectors ( axis) is smaller than a threshold (we also consider discrete rotations of the patches here, to check all symmetry cases)

  5. check if the distance between their position relative to the robot is smaller than a threshold m

To compare normal vector angles, we consider every possible symmetry. If any patch in the map matches with a trained one, we execute the corresponding motion sequence.

We ran the aforementioned experiment twenty times for each rock, in which the robot never failed to match the correct trained patch and successfully run the motion sequence of placing its foot on the rock. Success was defined as maintaining balance and ending with the foot on the rock. We also placed the robot in front of a few other rocks, that were not in the training set, and only when the geometry of the rock was matching to the trained one (in terms of patches), the motion was executed. An example is visualized in Fig. 11, where the robot detects a match with the corresponding trained patch and executes the motion sequence. During the motion execution, we also noticed that the robot tracks very accurately a matched patch. More experimental details can be found in [SPL14, Kanoulas14].

Our basic controller is not a full solution to rough-terrain bipedal locomotion, so this experiment did not fully test the ability of the robot to walk continuously, while mapping and tracking patches. Thus, we performed a separate experiment, where RPBP walked in a circle on a flat surface, while mapping and tracking patches on nearby rocks. This verified that the KinectFusion camera tracking was sufficient to maintain camera tracking on a walking robot and that our patch mapping and tracking algorithms were able to maintain a patch map around the robot as it moves through the environment. A video with a set of patch mapping and tracking results can be reached at the following link:

The patch mapping framework has been also successfully used on the WALK-MAN full-size humanoid robot during the DARPA Robotics Challenge (DRC) 2015 finals. In particular, we developed an interface, where the pilot could select manually a sequence of areas to fit foothold patches in the environment and use the roughly flat ones for dynamic stepping.

7 Discussion and Limitations

7.1 Discussion

There are various ways to represent environment surfaces in the computational stack from perception to planning and control. A common approach is to allow the perception system to report arbitrarily complex dense surface geometries – e.g. point clouds, triangle meshes, NURBS patches – to the planning and control systems. Our set of curved patch types is a proposal in the opposite direction where the perception system identifies a sparse set of (possibly overlapping) foot-scale patches of nearby terrain, each of which is described by only a few parameters. This is an architectural choice intended to make building the planning and control algorithms more tractable. The planning system could select a specific patch for the next footfall and communicate just that patch to the control system. The dynamics of the support phase could then be handled by considering contact with only that patch. In this paper we focus mostly only on perception, but in [KZNKCT2017, KWRZTSCT2018] we implemented fast footstep planning using the curved patches introduced here.

We include curved vs just flat patches for several reasons. First, in very rough terrain, it may not always be feasible to find flat surfaces. Humans do sometimes step on rocks with major curvatures, such that either tangential/rolling (positive curvature) or fixed (negative curvature) contacts take place. Bipedal robots which eventually approach the level of capability to walk on very rough terrain that humans do may also need to utilize such non-flat contacts.

Second, though it is true that most current humanoid and bipedal robots have flat feet, in the future this may not always be the case. Using curved patches also enables extension to quadrupeds which often have spherical feet. Curved patches can also be used for other tasks, such as grasp affordances for manipulation [KLCT2016].

Third, curved patches could also be used to represent surfaces on the robot itself, as shown in Fig. 9, so that robot-environment interactions could be reasoned about as homogeneous patch–patch contacts [KZNKCT2017].

One could imagine a different taxonomy of patch types than the one we proposed. For instance, the four paraboloid types with different curvature signs could be reduced to a single generic one. Our reasoning to keep them separate is that contact kinematics would likely be significantly different for these types, so identifying them in the perception system could offer a clearer choice of options to the planning and control systems. Similarly, the four planar types with different boundary shapes could be reduced to one with a generic boundary, or the planar type could just be a special case of a generic paraboloid patch which happens to have zero curvatures. We feel that these choices would mostly just push the task of differentiating these types of surfaces from the perception to the planning and control systems. (We mainly include spherical and cylindrical patches in the taxonomy to more accurately represent man-made surfaces, but they could be cleanly omitted from implementations if this is not needed.)

7.2 Limitations

The patch mapping and tracking framework we propose shows some positive and promising results on the difficult problem of bipedal foot placement on non-flat surfaces. The proposed system has some limitation though that we discuss briefly in this section.

We use RGB-D data from a range sensor, e.g. a Kinect or PrimeSense Carmine 1.09. These sensors are often very accurate for data acquired close to the device. Thus, the data uncertainty, especially for a mini-bipedal robot, is smaller than the uncertainty that is introduced, for instance, due to the kinematics of the legs. We have done an uncertainty analysis in [KTV2016], but integrating uncertainty from the robot kinematics into the patch map would be an interesting extension we have not yet developed.

Using the patch uncertainty based only on RGB-D sensor uncertainty was thus not strictly necessary for a mini-biped. This is not true when more noisy sensors on bigger humanoid robots are used, such as the Multisense-SL sensor that is installed on our WALK-MAN humanoid. In this case, the sensor uncertainty can play an important role.

Moreover, we use only the depth information of the range sensors. The RGB data are ignored, but could play an important role, for instance when semantics about the environment need to be extracted, e.g. for finding rigid terrains. In addition, from the IMU data we considered only the gravity information. Other data could be extracted and used in patch selection, such as the direction/speed of movement.

Our curved patch models use second degree polynomials which may be concave, convex, flat, or saddle-shaped. A limitation comes with environment surfaces that have more complex structure, but we believe that this can be handled with impedance or torque control stepping methods, in a low-level control fashion. This is also related to the patch-size. In our current work, we pick patches slightly bigger than the foot size. In nature, though, there are cases a foot may need to make partial contact with smaller surfaces. Also related is the patch validation that we introduced, which drops patches that are partially not appropriate for contact; it may be interesting to attempt to use these patches if possible.

A limitation in the patch mapping and tracking part is the use of the GPU for the Moving Volume KinectFusion system. Even though most of humanoids at the moment are able to carry GPUs, it would be interesting to make the approach more energy efficient by using the CPU instead.

Given that graph-based path planners may need a bigger set of patches to expand on their states, it would be interesting to make the patch fitting faster and/or design a real-time path planner-oriented seed selection algorithm. In terms of the human-based saliency measures, an interesting direction could be the analysis of human gaze patterns.

Last but not least, the experimental part tests a simple controller to prove the concept of the patch-based contact stepping. A better quasi-static or dynamic controller integrated with a footstep path planner is required for more meaningful and realistic locomotion experiments. In [KZNKCT2017], we used a dynamic locomotion module. In addition, more rough/unstructured rocks should be used in the future experiments.

8 Conclusion

This paper introduced a novel real-time mapping and tracking system for bipedal locomotion based on bounded curved patches. We ran experiments for mapping and tracking patches and for foot placement on some rocky terrains using a mini-biped with on-board RGB-D and IMU sensors. In separate work, our patch mapping and tracking algorithms were also used for fitting manually picked footholds, during the 2015 DARPA Robotics Challenge, for the WALK-MAN humanoid robot. We envision this perception framework to be part of a path planning and humanoid walking system, using the relatively sparse map of curved patches, instead of a dense environment representation. We also plan to use the patch uncertainty in a state estimator system data fusion and foothold selection.