Over the last few years, the automotive industry has been focused on developing autonomous driving vehicles to reduce accidents and increase independence. As an intermediate step toward fully autonomous vehicles, the importance of active safety technologies, such as adaptive cruise control, blind spots warning, and automatic park system, has increased. Those features rely for most on sensor-based technologies, that try to understand the host vehicle’s surrounding, i.e. to detect dynamic and static obstacles within a certain range. For example, moving objects, such as pedestrians and vehicles, can be detected to warn drivers to be cautious. Automatic detection of road signs can be used to control or adjust vehicles speed.
Curbs or sidewalks (in addition to road surface markings) are clues that are exploited in positioning systems. Often they indicate the boundary of parking areas. Technologies that can accurately detect and estimate curb location and height are embedded in any assist/autonomous parking systems: they enable to predict the vehicle-to-curb distance, hence to avoid a potential collision between the curb and the vehicle, cause of damages on tires and bumpers. The constraints put on such systems are very high: near-zero false negative detection, distance and height estimation with centimeter accuracy.
The challenges to build such systems are at least two-fold. First, curbs are objects of small size. This compels to use sensors of very high resolution to capture data where the object of interest covers a sufficiently large region. Second, curb shapes and appearance textures can vary drastically (depending on weather condition, pavement material, painting, etc.). This requires the development of advanced recognition techniques, capable to robustly classify an object as a curb or not.
Our system estimates the relative position, orientation and size of a curb w.r.t. a host vehicle, by the means of a monocular forward-viewing fisheye camera, advanced geometrical reasoning, temporal analysis and machine learning.
Most common techniques in the literature tackle mainly the first issue, i.e. 3D detection, using active sensors (LIDAR, laser range finders, etc.) or camera stereo-vision systems. Those approaches assume that the curb’s shape is in itself a discriminative and robust feature. They are likely to fail in poor SNR conditions (e.g. bad weather) or when facing damaged curbs. To our knowledge, few work attempted to couple 3D geometry with vision-based appearance models, so as to improve robustness and accuracy.
In this paper, we describe a system intended to assist the drivers and reduce the risk of running over obstacles, such as the curbs, in the everyday parking activities. Thereby, preventing unwanted damages and accidents. A single monocular wide-angle camera is used as an input device, which is one of the most economically efficient solutions nowadays. The cost of such sensors sometimes can be several magnitudes lower than the price of the more complex devices, such as LIDARs and stereo cameras. Our system works in real time, while maintaining high detection rate and curb parameters estimation accuracy. It relies on geometric reasoning, simple hand-crafted features (image edges and Histograms of Oriented Gradients – HOG), model-based machine learning techniques (Support Vector Machines – SVM) and temporal filtering. Our Inverse Perspective-compressing Mapping (IPcM) technique approaches the curb edge detection in a sophisticated scale-invariant manner, without the need of maintaining large multi-scale image space in order to reliably handle the broad variations of the curb size in the image. All that allowed us to come up with an CPU-based implementation giving our system high level of flexibility. For example, the current setup consists of a single front-mounted fisheye camera, which is applicable to perpendicular and diagonal forward parking scenarios. Just by attaching few more cameras to the system that cover the lateral and rear perimeters around the vehicle, will extend the system applicability to perpendicular reverse and in-line parking. The aforementioned advantages also would let our system to operate on a single computing platform in conjunction with algorithms designed to solve more challenging tasks based on deep models, such as motion planning and control. Thus, each subsystem occupies separate processing unit - CPU and GPU.
Ii Related Work
The research of the curb detection algorithms can be figuratively organized according to the types of the sensors employed. Undoubtedly, the LIDARs are among the most popular active sensors. They provide 3D point cloud data based on laser scanning and range measurements. In [zhao_curb_2012], for example, the authors voxelize the LIDAR data for computational efficiency and detect those containing ground points, based on the elevation information and plane fitting. The candidate curb points are selected using three spatial cues. Employing short-term memory technique along with a parabolic curb model and RANSAC they remove the false positives. For temporal curb tracking a particle filter is used. In [yao_road_2012] the curb is modeled as parabola and Integral Laser Points features are used for speed up. Instead of temporal filters and spline fitting methods, in [hata_robust_2014] a robust regression method to deal with occluding scenes, called Least Trimmed Squares (LTS), is used in combination with Monte Carlo localization. In [chen_velodyne-based_2015] instead of extracting features, the LIDAR scan lines are processed directly. Initial curb point candidates are determined by Hough transform and then iterative Gaussian Process Regression is used to represent the curb models. In [zhang_real-time_2015]
the parabola model is employed as well, but the tracking technique is based on Kalman filter in combination with GPS/IMU data.
Another active sensor which provides 3D point cloud data is the Time-of-Flight (ToF) camera. It extracts the depth information from a 3D scene based on the phase shifts of light signals caused by the different times they travel in space to bounce off the objects and return back to the camera. In [scheunert_free_2007] the authors take advantage of the ToF camera’s high frame rate to improve the results by space-time data accumulation using grid based approach. For estimating ego-motion parameters they employ Kalman filter. In [gallo_robust_2008] CC-RANSAC method is used for improved plane fitting into the raw point cloud data.
The laser range finders (LRF) are active sensors from the LIDARs family, but instead of providing 3D point cloud data, they usually scan just a single line and estimate the distances of each measurement point along it. The curb detection algorithm in [byun_autonomous_2011] is accomplished in two steps. Firstly, the authors detect the potential curb positions in the LRF data, then they refine the results by employing Particle filter. In [liu_curb_2013], LRF data captured sequentially is used to build local Digital Elevation Maps (DEM) and Gaussian process regression is at the final curb detection stage. A set of 3 LRF sensors is used in [pollard_step_2013]. Peak detection is accomplished on the results of derivative-based distance method described there and then they merge the data from the individual sensors.
A popular passive sensor is the stereo camera. Similar to the LIDARs, it provides 3D point cloud data which has higher resolution, but usually contains more noise. DEMs are often used for efficient representing the 3D data in the area of curb detection. In [oniga_curb_2008], edge detection is applied to the DEM data to highlight the the height variations. The noise from the stereo data is reduced significantly by creating multi-frame persistent map. Hough accumulator for lines is built with the persistent edge points. Each curb segment is refined using the RANSAC approach to fit optimally the 3D data of the curb. In [siegemund_temporal_2011], a 3D environment model is utilized. It consists of various primal entities, such as road, sidewalk, etc. The 3D data points are assigned to the different part of the model using temporally integrated Conditional Random Fields (CRF). In [oniga_curb_2011], temporal integration of the DEM data is also used, but in combination with least squares cubic spline fitting. The algorithm described in [kellner_multi-cue_2015] presents an interesting idea of combining the 3D point cloud data with the intensity information from the stereo camera. First, they extract model-based curb features from the 3D point cloud and validate them by using the intensity image data. The curbs are presented as 3D polynomial chains. The approach described in [fernandez_curvature-based_2015] is based on the curvature of the 3D point could data. It is estimated by applying the nearest neighbor paradigm. The method’s performance is evaluated by applying it to both – stereo camera and LIDAR data.
All the sensor types used in the algorithms above directly provide some kind of 3D information, either point cloud or line-wise. However, monocular cameras data lacks completely from depth information. Thus, extracting curbs using them is a challenging task, usually founded on preliminary constraints and assumptions. In [haltakov_scene_2012] the image is divided in regular grid of cells. Then 3D reconstruction is applied by pixel-wise image labeling based on CRFs. Besides the camera, the vehicles CAN-bus data is employed as well in [seibert_camera_2013]. Then two complementing methods are applied: localizing borders using texture based area classification with local binary patterns (LBP) features and Harris features tracking using Lucas-Kanade tracker to extract 3D information. The curb detection system described in [prinet_3d_2016] is closely related to our approach. It also involves use of a fisheye camera and incorporates Histogram of Oriented Gradients (HOG) features. Unlike our approach, they preserve the original camera image. Hence, their curb model is polynomial. The temporal filtering is based on Kalman filter.
Iii System Description
Iii-a Algorithm summary
The objective of our system is to acknowledge the presence of a curb close to the vehicle, along its forward motion path, and to help identifying if it is an immediate thread to the vehicle’s integrity by estimating its position, orientation and size relative to the vehicle. These parameters constitute the system state vector , described later in Section III-B. The system tracks just one curb at a time, as only the closest to the vehicle one is significant.
Fig. 2 illustrates graphically the paradigm our system is founded on. Its shape likens inverse pyramid, situated in the Time/Entropy
plain. Pyramid tip points to the moment of timewhich corresponds to the last captured camera frame (image). The width of its layers (and their coloring) depicts overall entropy rate in terms of the state vector estimation. The paradigm is inspired by the idea of the attentional cascade presented in [ViolaJones], but instead of boosted classifiers we use various filtering techniques. The direction of processing flow is from the top to the bottom and each stage is purposed to reduce system’s state entropy until the uncertainty is low enough that a reasonable inference for the values of curb parameters can be made.
The cascade consists of two major layers which handle the information from space and time perspectives. Immediately after a new image is delivered by the camera at time , it is fed to the “Spatial domain” pyramid layer, which consists of three sub-layers. The first one searches for individual curb’s primitive structural elements in the image. As such, we engage curb’s edges, since they are 3D lines. Projecting them onto the camera’s image plane won’t take away their straightness, because of the linearity of the perspective transform. Therefore, curb’s edges can be detected just by performing line detection in the image. This part of our algorithm is described in Section III-E1.
Next, we raise the level of generalization up by utilizing curb’s geometry itself, i.e. the configuration of its primitive structural elements (points, edges, faces) that constitute its 3D structure. Here, our algorithm relies on the prior knowledge of curb’s geometry and the 3D to 2D correspondence
in order to estimate which compositions of image lines could probably represent a projection of a 3D curb-like body. All the successful guesses mold the initial hypothesis set ofcurb candidates. Its outlying members are meant to be rejected by the next layers of our paradigm pyramid. Detailed description is presented in Section III-E2.
The last operation in the spatial domain shifts the focus from curb’s geometry to its appearance in the image. Here we perform object detection to validate every curb candidate by the means of sophisticated Machine Learning techniques. More information can be found in Section III-E3.
The purpose of our system is to estimate curb parameters while vehicle is in motion. Luckily, it is a considerably massive object with predictable kinematics. Hence, the evolution of curb’s parameters (system state) in the “Time domain” follow smooth trajectories, with no abrupt discontinuities. In other words, our system deals with environment which obeys temporal continuity. Thus, from all the curb candidates we can select only those which comply with it and also make reasonable predictions for system’s future states. Our Curb tracking technique is described in Section III-F.
The most bottom layer of the pyramid represents the residual uncertainty which our system, as a non-ideal one, cannot resolve and bring the entropy to the theoretical value of 0, i.e. 100% confidence about curb parameters estimates. Our goal is to minimize it and in Section IV we present results which demonstrate the promising performance of our system.
Iii-B Geometrical considerations and definitions
Here we describe the fundamental assumptions and constraints our algorithm is based on.
We consider the road as a perfectly flat structure. Although, the real roadways are not ideally planar due to technological slopes or deformations caused by exploitation, their surface curvature is smooth enough to allow us make such an assumption considering the size of our system’s working area (Curb Detection Domain, defined below in III-B4) and the amount deviation from the perfect plane within it. The road plane is denoted as and its projection in camera image as (Fig. 1).
The camera is mounted in the middle of vehicle’s front and points upon vehicles forward movement direction. Camera’s projection center (focal point) is elevated above the road at a fixed distance and this is where camera’s coordinate frame originates (Fig. 1). During parking vehicle’s speed is relatively low, therefore we can neglect the actuation caused by vehicle’s suspension system. Furthermore, for the sake of simplicity and computational efficiency we define that camera’s plane is parallel to the road plane (Fig. 1).
We define the curbs as rectangular prismoidal rigid structures, which determine road plane borderlines. They are usually significantly elevated above the road surface and often specify the boundaries between the road and sidewalk, for example. We also assume that the curbs have negligibly small fillets (roundings) of the corners, which results in clear and abrupt brightness transitions (edges) in the camera images.
Our algorithm uses curb’s edges as primal features. They are straight, easily detectable and can help us to drastically reduce curb detection time. We would like also to note that if a curb appears in the image, three of its horizontal edges and two faces defined by them will always be presented in it (Fig. 3a and b). Let , and be the three lines depicting curb’s lower front (base), upper front and rear edges, respectively (Fig. 1). Accordingly, , and are their projections in the image (Fig. 3b).
Curb detection and localization aim to the estimation of four basic curb parameters: – the distance between the curb and camera, – the rotation angle of the curb about vertical axis, and – curb’s height and depth, respectively. To precise the definition of we introduce the notion of curb’s reference point (Fig. 1), which is the intersection between the plane and – curb’s base edge. Then can be described as the distance between and the orthogonal projection of the origin on the road plane . As a consequence of our system’s simplified geometrical configuration (see above), is equal to the -coordinate of in camera’s frame . can be defined as the angle between camera’s axis and the edges , is the distance between and , respectively, is the distance between and (Fig. 1). Semantically, we split curb’s parameters in two groups – essential and secondary. , and are considered as essential, because they provide enough information to determine the safe-clearance between the curb and vehicle. is considered secondary and it is used for an additional information cue to improve system’s reliability during operation.
We define system’s state vector as follows
and it holds all the curb parameters. Their estimation is accomplished by the means of a 3D parametric template (shown in Fig. 3c). Similar to the curb, it consists of two orthogonal rectangular faces and three edges , and which correspond to the curb’s ones. Consequently, their projections in the image plane are , and , respectively. The template has the same set of parameters as the curb and they are organized in the template’s state vector
Detailed description of the fitting procedure is presented in Section III-E2.
A curb detection in the image frame at time is considered as successful, if at least its essential parameters are estimated correctly. Therefore, we need to determine at least the position and orientation of and w.r.t. the camera from their projections and , respectively.
Iii-B4 Curb detection domain and image’s curb searching region
From a practical point of view, we define the rectangular area of the road plane directly in front of the vehicle as Curb Detection Domain (CDD) (Fig. 4). In essence, CDD determines system’s domain of definition (or operation) w.r.t. . is chosen to be 500 cm, whereas is calculated from camera’s vertical Field of View (FoV) bottom boundary. In our setup its value is cm. The CDD’s side limit is derived from FoV as well and is rounded to 130 cm. The total area covered by the our system’s CCD is . The four sides of CDD are defined by the lines , , and . Their projections in the image are respectively , , and .
In order to reduce the amount of data being processed for each frame and eliminate the influence of outliers, we introduce the notion of Curb Searching Region (CSR) in the image (Fig. 5b). It defines the image area used for extracting curb features. Its size and position are variable and determined by the expected curb location at the time of current camera frame. Essentially, it represents the projection in the image of a CDD subregion which has the same width, but significantly shorter length. Its initial position is set at the far end of CDD and in Curb tracking mode (see Section III-F) the system updates its position accordingly.
Iii-C System calibration
All images from the camera are rectified before any further processing to eliminate the radial distortions introduced by the fisheye lens (Fig. 5). Otherwise, curb detection would be much more complex, involving second or higher order curves detection and fitting. We employ the fisheye camera model described in [fisheyeCalibration] to estimate camera’s intrinsic parameters in an offline calibration procedure using a planar target. In the rest of this paper, by the notion “image” we refer to the rectified version of the original image, unless anything else is explicitly stated.
The next step is, assuring that the camera extrinsic parameters follow the geometric definitions above (Section III-B). We don’t estimate those parameters through a calibration procedure. Instead, we set some of them manually. For instance, camera tilt angle should be set to zero. To achieve that, we employ a simple four-steps calibration procedure, illustrated in Fig. 6-top. A point target (marker) mounted on a stand (tripod), whose height is adjustable, is used. Repeating the these steps 3-4 time ensures that camera orientation will easily converge to the desired state. Not only the correct orientation of the camera is set during this process, but we also get an accurate estimate of the distance , which is the distance from target center to the ground and can be measured with a ruler.
Fig. 6-bottom illustrates our technique to estimate the horizontal offset between camera and LIDAR origins – , which is needed when evaluating system distance measurement capabilities. The peculiarity here is that the position of camera coordinate frame origin is hard to be measured. Therefore, we came up with a technique to estimate its position implicitly. More precisely, we do not measure its actual position, but the position of its projection on the road plane . Thanks to our geometric setup this is sufficient to estimate . We use a 3D model of a planar large-scale ruler with a regular grid (10 cm per division). First, we create a virtual 3D model of it, which is coplanar with the road plane , its origin coincides with and its measurement axis is parallel to . Then we project this virtual model of the ruler to camera plane and overlay its projection in the image. Next, we take a printed version of the same ruler model with the same scale, lay it in front of the camera and fit it to the overlaid projection of its virtual analog. When finished, we know that the origin of the printed ruler corresponds to the origin of the virtual one and, consequently, to . Afterwards, we use a test obstacle to initiate measurements from the LIDAR and the camera, through the ruler (see Section III-D). By averaging the differences of those pairs of measurements we can calculate .
Iii-D Measuring distances with a monocular camera
The plane of camera’s coordinate frame is illustrated in Fig. 7. Let is a point from the road plane with coordinates in camera’s coordinate frame. Its projection onto the image plane is the point w.r.t. the frame originating at the principal point . The relation between their coordinates is
where , and are camera’s focal lengths along its and axes, respectively. From (3) we get , which means that depends only on , since and are practically constants. Furthermore, if we invert the equation, we can calculate the distance of every point from the road plane just by using the vertical coordinate of its projection from the image. Hence, for curb’s reference point (Fig. 1) can be rewritten
where is the vertical coordinate of ’s projection in the image.
Iii-E Detect curb candidates in an image
Here we describe our approach for detecting a curb in the image. The algorithm is inspired by the paradigm for boosted attentional cascade presented in [ViolaJones], but instead of using a cascade of boosted classifiers with gradually increasing complexity, our cascade consists of various filtering techniques. Thus, the amount of data being processed is greatly reduced by rejecting image regions which do not contain curb features. As result, only positively classified curb candidates are left for further temporal analysis.
Fig. 8 illustrates our curb detection pipeline, which consists of three consecutive operations:
Iii-E1 Curb scale-invariant edges extraction
As we have already mentioned earlier, the primary features which we exploit are the curb’s edges (Fig. 3). The straight lines in the image are extracted by the well known combination of Hough transform (HT) [HoughTransform] and the Canny edge detector [Canny] applied to the camera images. Also, as we have already described in Section III-B3, the least sufficient condition to perform successful curb detection in the image at time is that both curb’s edges projections and are correctly detected. They define curb’s frontal face projection in the image on which we are going to emphasize here.
Let is curb’s frontal face vertical spatial sampling rate in the image
where is the curb’s frontal face projection height in the image. Essentially, provides information about how many sampling locations (pixels) are used by the camera to represent every centimeter of a vertical line from curb’s frontal face, which is located at the distance from the camera. As a consequence from (5), is a non-constant function with respect to , which is demonstrated graphically in Fig. 9a. The figure shows three sample images of the same curb taken at 3 different distances . It is obvious that (marked in the figure) varies significantly. In particular, based on our system’s setup the ratio
i.e. the projection of a curb taken at the distance will contain about 17.5 times more pixel information than the projection of the same curb taken at a distance .
Our system relies on detecting curb’s edges. Principally, edge detection aims in finding the points of discontinuity in the image brightness by incorporating either its first- or second-order derivatives. The Canny edge detector, for example, uses first-order operators, such as Sobel [Sobel1968, duda1973pattern], Scharr [jähne1999handbook], Prewitt [lipkin1970picture] or Roberts cross [Roberts63] to estimate brightness gradient magnitude and direction. Having such a significant variation of , though, results in volatile gradient magnitude over curb’s edges projections, thus inconsistent curb detection. In order to overcome this problem, we have derived an image remapping technique – Inverse Perspective-compressing Mapping (IPcM), which aims to equalize the spatial sampling rate of the curb in the image (Fig. 9b). After warping the image, the detection of curb’s edges tends to be much more steady and robust through the entire CDD. Also, the original trapezoidal shapes of CDD and CSR in the image are transformed into rectangles. Detailed explanation about the derivation of our technique can be found in Appendix V.
The classical Inverse perspective mapping (IPM) normalizes the spatial sampling rate of the road and any other parallel to it plane (Fig. 9c), but not for the orthogonal frontal curb’s face, as can be seen on the figure. Notably, is still dependent on
and varies considerably. The difference is that after the transformation their relation is proportional. Moreover, the IPM transform introduces an additional issue, which can be easily acknowledged from the figure. The output image suffers from gradually increasing interpolation smoothing, mainly along its vertical axis. It is caused by the irregular density of camera pixels sampling locations in the 3D scene (the density of the pointsin Fig. 21 is variable).
Let be the set of straight lines in the image at time , detected by applying edge detection to the IPcM remapped image and then transforming the lines back to the original image space. is the size of the set and are vectors representing the individual lines from the set by their parametric form: – slope and – intercept. This procedure is expected to produce significant amount of outliers. That’s exactly what we are aiming at. The purpose of the following processing blocks of the flow diagram (Fig. 8) is rejection of everything, which is not associated with the curb. As the curb edges are expected to be among the longest ones in the CSR (extending through its full width), we can reduce the processing time by using just the six lines with the highest voting scores from HT (). The other types of edges from different geometric shapes, which could possibly be presented in the image (from sidewalk pavement, tiles, etc.), usually have much shorter length, hence lower voting scores.
Finally, we examine the detected lines as a set of points in their parametric space and try to find clusters. We consider every cluster as a noisy representation of a single edge. Thus, all the members in a cluster are replaced by its mean.
Iii-E2 Forming curb candidates set
We construct two new sets – and , which contain all possible combinations of 2- and 3-tuples, respectively, of non-intersecting lines from , i.e.
where are the homogeneous representations of the lines , is the camera’s image at time and are the sizes of the two sets. The lines in every tuple are sorted in ascending order of their vertical placement in the image, i.e. in descending order of lines’ intercept parameter 222Note that the vertical image axis points downwards..
A tuple of three lines is sufficient to estimate all of the four curb template parameters (2), rather than the 2-lines case, where we omit in favor of estimating the other three more important parameters: , and . In order to shorten our presentation we are going to describe the triplet case only, because it is more general, complex and could be considered as a special case subset of .
The next step is fitting the curb template’s projection in the image to each individual triplet in . Every line from -th triplet is explicitly related to a specific curb’s edge, because they will always appear in the image in the same vertical order. Therefore, represents , and . The objective of the fitting is to estimate the optimal values of the parameters , such that curb’s template edges projections in the image are aligned to their corresponding lines from the triplet in the best possible way.
We evaluate the similarity of two lines by calculating the Euclidean distance between two pairs of corresponding points laying on each of them, which we call control points. Hence, to fit the template to a triplet of lines we need to use six pairs of control points. In order to define their position, we need to introduce the planes and and their projections in the image and (Fig. 10a). They intersect in and and . Now curb template’s control points are defined by intersecting its edge lines with and , which results in two point triplets: and – Fig. 11. The calculation of their coordinates in camera’s coordinate frame is significantly simplified due to systems geometrical setup, namely
Afterwards, we can estimate their projections in the image as follows
where and is the camera matrix, and are camera’s principal point coordinates along the horizontal and vertical axes, respectively. It should be noted that since , and are constants, is function only of the curb parameters vector .
The estimation of -th triplet control points333We will call them target control points. positions in the image is demonstrated in Fig. 10a. It follows the presumptions of template’s control points location in the image and incorporating the properties of perspective projection. Firstly, we intersect with and thus estimate the control points and (depicted with cyan crosses in the figure).
We know that curb’s front face is vertical, i.e. parallel to camera’s image plane. Therefore, we find and by intersecting with the two vertical lines and from the figure which pass through and , i.e.
In Fig. 10 these control points are depicted by magenta crosses.
Camera’s principal point is the vanishing point (center of perspective), where the projections in the image of all lines parallel to converge. Hence, the lines and in Fig. 10a that constitute the projections of the intersections of curb’s top face and the planes will pass through it and we can estimate the positions of the last two target control points and as follows
They are depicted by the green crosses in the figure.
After we have derived the equations of all control points in the image, we can define the objective function, which we are going to optimize in order to fit the curb template’s projection in the image to the lines of -th triplet from . First, let the distance between two corresponding control points in the image be the -norm for their difference, i.e.
where . Then, we construct the error vector
and the curb template’s parameters that produce the best fit to the -th triplet are determined as follows
where is a normalization term, which regularizes the dependence between the template’s re-projection error in the image and the distance . As the minimization of -norm of a vector is equivalent to minimizing the sum of its squared elements and we have closed form differentiable solution for , we can incorporate Levenberg-Marquardt optimization algorithm [LMA1, LMA2]. Fig. 10 illustrates an example of a successfully optimized (fitted) curb template.
After fitting the curb template to all the line tuples in and , we build the curb candidates set
The next task is rejecting the outliers from . The first level of filtering is described in the next section, where curb candidates are rejected based on their appearance in the image at time . Afterwards, among the “survivals” only the ones, whose parameters follow the prediction, based on temporal analysis of the previous frames, are selected. This procedure is described in Section III-F.
Iii-E3 Appearance-based curb candidates filtering
It is very unlikely that a trustworthy curb detection could be accomplished by employing only the straight curb’s edges from the image, since they are not informative enough. I.e. relying only on the curb’s geometry won’t bring down the entropy to levels that the algorithm can reliably discriminate between curb and non-curb shapes. At this stage we try to reject the outliers in by exploiting curb’s appearance in the image. Thus, we have built an object detector based on a Support Vector Machine (SVM) and Histograms of Oriented Gradients (HOG) features [HOG].
Curbs occupy areas in the image, which have the shape of thin, mostly horizontal, stripes. They are characterized by a couple of distinctive transitions of pixels’ brightness along the vertical axis, caused by the differences in reflecting/scattering properties of the individual curb’s surfaces. In the case of curb detection in images the HOG is a suitable descriptor, becuase it accounts the direction of that transitions and a machine learning algorithm can be trained to discriminate among curb and non-curb image patterns. Moreover, HOG is also popular with its computational efficiency.
We extract the HOG features from the IPcM image, because the vertical size of the curb is invariant to . Fig. 12 illustrates our sampling approach. Seven uniformly distributed square windows are sampled along curb template frontal face. Only the two frontal curb edges are needed to accomplish that operation. The size of the patches is determined by the distance between the two edge lines in vertical direction (on the figure, depicted by ). Each window is scaled down to a fixed size of px – Fig. 13. In order to eliminate the influence of the high-frequency components, we apply a smoothing Gaussian filter to the windows before calculating their HOG features. In the end, the pixel information of each of them is converted to 288-dimensional HOG feature vector, which is fed to the pre-trained linear SVM classifier [liblinear]. Fig. 13 illustrates examples of a positive (a) and a negative (b) windows. Even an unexperienced human eye can easily notice the considerable difference between the gradient histograms of the two samples.
We define our classification problem in a Multiple Instance Learning manner. The set of the seven feature vectors sampled form the same curb template from form a bag of instances . We define two types of bags – positive () and negative (). If the majority of the instances in a bags are positive, then that bag is considered positive. And respectively, if the majority of the instances are negative, then the bag is also negative. Thus, our classifier will be more robust against outliers in the bags. In other words, instances, which represent a curb, but are sampled from a regions which contain vertical cracks or joints, is expected to be placed far from the other instances of the same class in the feature space. The curb candidates from that are approved by this classification procedure form the final in-frame curb candidates set
where is its size. Note that if , the curb detection is unsuccessful and thus, we accept that such frames does not contain curbs.
Iii-F Tracking the curb through the time
Here we present our scheme for curb tracking in the time domain at frame-to-frame basis. The reasoning here is based on the assumption that curb false candidates are result of faulty lines detection that occur, because of the noisy output from the Canny operator and false positive classification by the HOG+SVM. To the first reason contributes the significant local contrast in the small details produced by the HDR camera we use. Therefore, we can assume that those faulty detections don’t follow a predictable and smooth pattern in the time domain.
We have prior knowledge regarding the nature of curb parameters evolution (1). Namely, we know that and are constants and and evolve smoothly over time, since the vehicle is a physical object whose motion is continuous function of time. Thus, we model them as autoregressive processes.
The temporal filtering we apply has two alternating modes:
Collecting initial prediction set
During the first one, the successful curb candidates set of the current frame at time is appended to a finite length buffer . When the critical minimum for tracking is reached (5 consecutive successful frames), curb parameters prediction lines are estimated and then the mode is switched to “Curb tracking”. I.e. , where .
The prediction lines are used to predict the future system state and to smooth the measurements of the current state, by taking into account the previous system states. We assume that within a short span of time (for example 7 frames) the evolution curves of the curb parameters have mainly linear character. Each curb parameter has its individual prediction line –
. Their parameters are estimated by linear regression applied to the corresponding elements of the curb candidate vectorsof every possible combination of 5 candidates , sampled from . The combination, which has minimal fitting error, is chosen to be the prediction samples set at time .
Now, let’s assume that system processing mode at the current time is “Curb tracking”. The flow chart diagram from Fig. 14, presents it graphically. The first operation is predicting system state at time
based only on the information from the previous frames – . This gives us information for the approximate position of the curb in the current frame and length and position of CSR can be updated accordingly before detecting the curb candidates in the second operation.
Then the current frame’s prediction set is obtained by selecting the closest to the prediction curb candidate from and appending it to . The last step is estimating the prediction lines set for the current frame and the smoothed (filtered) version of the curb state , which supposedly contains much less noise than the individual measurements.
Iv Experimental Results
Iv-a Video dataset
We have collected a dataset, which consists of 11 videos captured with a monocular forward-view fisheye HDR camera in typical forward perpendicular parking situations during the bright part of the day in a natural lighting environment – Table I. All but one videos contain a single sidewalk curb. Only video sequence 8 has two perpendicular curbs presented in the CCD. Two distinct weather conditions were presented at the time of data collection – clear/sunny, which is characterized with sharp deep shadows and bright highlights creating unreal edges in the image, and shadow (overcast), which is characterized with soft shadows that smoothly grade to highlights. In sequences 2, 3, 8, 11 the front bottom curb edge is fully or partially obstructed by tree leaves and different kind of debris. Camera frame rate is approximately 21 fps and the original resolution is pixels.
The distance ground truth (GT) data is collected by the means of a point-wise LIDAR sensor. The height and the depth of the curb are measured manually with a ruler. No direct ground truth measurements are made of curb’s rotation angle . It has been implicitly estimated through manually fitted template to the curb appearances in the video frames. The maximum labeling distance is 5 m.
|Vid. seq. #||Curb height (cm)||Curb depth (cm)||Weather conditions||Curb/road physical properties||Frames count|
Total number of frames in the sequence/Number of the frames with curb presented in the CCD