Obstacle detection is of central importance for lower-end small unmanned surface vehicles (USV) used for patrolling coastal waters (see Figure 1). Such vehicles are typically used in perimeter surveillance, in which the USV travels along a pre-planned path. To quickly and efficiently respond to the challenges from highly dynamic environment, the USV requires an onboard logic to observe the surrounding, detect potentially dangerous situations, and apply proper route modifications. An important feature of such vessel is the ability to detect an obstacle at sufficient distance and react by replanning its path to avoid collision. The primary type of obstacle in this case is the shoreline itself, which can be avoided to some extent (although not fully) by the use of detailed maps and the satellite navigation. Indeed, Heidarsson and Sukhatme  proposed an approach that utilizes an overhead image of the area obtained from Google maps to construct a map of static obstacles. But such an approach cannot handle a more difficult class of dynamic obstacles that do not appear in the map (e.g., boats, buys and swimmers).
A small USV requires ability to detect near-by and distant obstacles. The detection should not be constrained to objects that stand out from the water, but should also detect flat objects, like debris or emerging scuba divers, etc. Operation in shallow waters and marinas constrains the size of USV and prevents the use of additional stabilizers. This puts further constraints on the weight, power consumption, types of sensors and their placement. Cameras are therefore becoming attractive sensors for use in low-end USVs due to their cost-, weight- and power-efficiency and a large field of view coverage. This presents a challenge for development of highly efficient computer vision algorithms tailored for obstacle detection in a challenging environments that the small USVs face. In this paper we address this challenge by proposing a segmentation-based algorithm for obstacle-map estimation that is derived from optimizing a new well-defined graphical model and runs at over 70fps in Matlab on a single core machine.
1.1 Related work
The problem of obstacle detection has been explicitly or implicitly addressed previously in the field of unmanned ground vehicles (UGV). In a trail-following application Rasmussen et al.  use an omnidirectional camera to detect trail as a region that is most contrasted to its surrounding, however, dynamic obstacles are not addressed. Several works, e.g., Montemerlo et al.  and Dahlkamp et al. , address the problem of low-proximity road detection with laser scanners by bootstrapping color segmentation with the laser output. The proximal road points are detected by laser, projected to camera and used to learn a Gaussian mixture model which is in turn used to segment the rest of the image captured by the camera. Combined with horizon detection of Ettinger et al. , this approach significantly increases the distance at which the obstacles on the road can be detected. Alternatively, Lu and Rasmussen 
casted the obstacle detection as a labelling task in which they employ a bank of pre-trained classifiers to 3D point clouds and a Markov random field to account for the spatial smoothness of the labelling.
Most UGV approaches for obstacle detection explicitly or implicitly rely on ground plane estimation from range sensors and are not directly applicable to aquatic environments encountered by USV. Rankin et al.  propose a specific body-of-water detector in wide open areas from a UGV using a monocular color camera. Their detector assumes that, in an undisturbed water surface, a change in saturation-to-brightness ratio across a water body from the leading to trailing edge is uniform and distinct from other terrain types. They apply several ad-hoc processing steps to gradually grow the water regions for the initial candidates and apply a sequence of pre-set thresholds to remove spurious false detections of water pools. However, their method is based on the undisturbed water surface assumptions, which is violated in coastal and open water applications. Scherer et al.  propose a water detection algorithm using a stereo bumblebee camera, IMU/GPS and rotating laser scanner for navigation on a river. Their system extracts color and texture features over blocks of pixels and eliminates the sky region using a pre-trained classifier. A horizon line, obtained from the onboard IMU, is then projected into the image to obtain samples for learning a color distribution of the regions below and above horizon, respectively. Using these distributions, the image is segmented and results of the segmentation are used in turn, after additional postprocessing steps, to train a classifier. The trained classifier is fused with a classifier from the previous frames and applied to the blocks of pixels to detect the water region. This system relies heavily on the quality of hardware-based horizon estimation, accuracy of pre-trained sky detector and the postprocessing steps. The authors report that the vision-based segmentation is not processed onboard, but requires special computing hardware, which makes it below a realtime segmentation at constrained processing power typical for small USVs.
Some of the standard range sensor modalities for autonomous navigation in maritime environments include radar , sonar  and ladar . Range scanners are known to poorly discriminate between water and land in the far field , suffer from angular resolution and scanning rate limitations, and poorly perform when the beam’s incidence angle is not oblique with respect to the water surface [12, 13]. Several researchers have thus resorted to cameras, e.g., [14, 15, 7, 16, 17, 13], for obstacle and moving object detection instead. To detect dynamic objects in harbor, Socek et al.  assume a static camera and apply background subtraction combined with motion cues. However, background subtraction cannot be applied to a highly dynamic scenes encountered on a moving USV. Huntsberger et al.  attempt to address this issue using stereo systems, but require large baseline rigs that are less appropriate for small vessels due to increased instability and limit processing of near-field regions. Santana et al. 
apply fusion of Lukas Kanade local trackers with color oversegmentation and a sequence of k-means clusterings on texture features to detect water regions in videos. Alternatively, Fefilatyev and Goldgof and Wang et al. apply a low-power solution using a monocular camera for obstacle detection. They first detect the horizon line and then search for a potential obstacle in the region below the horizon. A fundamental drawback of these approaches is that they approximate the edge of water by a horizon line and cannot handle situations in coastal waters, close to the shoreline or in marina. At that point, the edge of water does not correspond to the horizon anymore and can be no longer modeled as a straight line. Such cases call for more general segmentation approaches.
Many unsupervised segmentation approaches have been proposed in literature. Khan and Shah  use optical flow, color and spatial coordinates to construct features which are used in single Gaussians to segment a moving object in video. Nguyen and Wu  propose Student-t mixture models for robustifying segmentation. Improved segmentation can be achieved by applying Bayesian regularization scheme in Gaussian mixture models, however, care has to be taken at initialization . Felzenswalb and Huttenlocher  have proposed a graph-theoretic clustering to perform segmentation of color images into visually-coherent regions. The assumption that the neighboring pixels likely belong to the same class is formally addressed in the context of Markov random fields (MRF) [22, 23]. By constraining the solutions of the segmentations to mimic high-level semantics of urban scenes, Felzenszwalb and Veksler  proposed a three-strip segmentation algorithm that can be implemented by a dynamic program. Wojek and Schiele  have extended the conditional random fields with dynamic models and perform the inference for object detection and labeling jointly in videos. The random field frameworks  have proven quite successful for addressing the semantic labeling tasks and recently Kontschieder et al.  have shown that structural priors between classes further improve the labeling. Alternative schemes that avoid applying a MRF to enforce spatial consistency have been proposed, e.g., Chen et al.  and Nguyen et al. . The approaches like Wojek et al.  use high-dimensional features composed of color and texture at multiple scales and object-class specific detectors to segment the images and detect the objects of interest. In our scenarios, the possible types of dynamic obstacles are unknown and vary significantly in appearance. Thus object-class specific detectors are not suitable. Several bottom-up graph-theoretic approaches have been proposed for unsupervised segmentation, e.g., [30, 31, 32, 33]. Recently, Alpert et al.  have proposed an approach that starts from a pixel level and gradually constructs visually-homogenous regions by agglomerative clustering. They achieved impressive results on a segmentation dataset in which an object was occupying a significant portion of an image. Unfortunately, since their algorithm incrementally merges regions, it is too slow for online application even at moderate image sizes. An alternative to starting the segmentation from pixel level is to start from an oversegmented image such that pixels are grouped into superpixels . Lu et al. 
apply spectral clustering to an affinity graph induced over a superpixelated image. Li et al. have proposed a segmentation algorithm that uses multiple superpixel oversegmentations and merges their result by a bipartite graph partitioning to achieve state-of-the-art results on a standard segmentation dataset. However, no prior information is provided to favor certain types of segmentations in specific scenes.
1.2 Our approach
We pursue a solution for obstacle detection that is based on concepts of image segmentation with weak semantic priors on the expected scene composition. Figure 2 shows typical images captured from a USV. While the images significantly vary in appearance, we observe that each image can be split into three semantic regions roughly stacked one above the other, implying a structural relation between the regions. The bottom region represents the water, while the top region represents the sky. The middle component can represent either land, parked boats a haze above horizon or a mixture of these.
Our main contribution
is a graphical model for structurally-constrained semantic segmentation with application to USV obstacle-map estimation. The generative model assumes a mixture model with three Gaussian components for the dominant three image regions and a uniform component for explaining the outliers, which may constitute an obstacle in the water. Weak priors are assumed on the mixture parameters and a MRF is placed over the prior as well as posterior pixel-class distributions to favor smooth segmentations. We derive an EM algorithm for the proposed model and show that the resulting optimization achieves a fast convergence at a low computational cost, without resorting to a specialized hardware. A similar graphical model was proposed by Diplaros et al., but their model requires a manually set variable, does not apply priors and is not derived from a single density function. Our model is applied to obstacle image-map estimation in USVs. The proposed model acts directly on color image and does not require expensive extraction of texture-based features. Combined with efficient optimization, this results in realtime segmentation and obstacle-map estimation (several-fold faster than the camera frame rate). Our approach is outlined in Figure 1. The semantic model is fitted to the input image, after which each pixel is classified into one of the four classes. All the pixels that do not correspond to the water component are deemed to be a part of an obstacle. Figure 1 shows a detection of a dynamic obstacle (buoy) and of a static obstacle (shoreline). Our second contribution is a marine dataset for semantic segmentation and obstacle detection, and the performance evaluation methodology. To our knowledge this will be the largest annotated publicly available marine dataset of its kind up to date.
A preliminary version of our algorithm was presented in Kristan et al.  and is extended in this paper on several levels. Additional discussion and related work is provided. Improved initialization of segmentation model by soft-resets of the parameters is proposed and additional details of the algorithm and the dataset are provided. In particular, the dataset capturing procedure and annotation is discussed and additional statistics of the obstacles in the dataset are provided. The experiments are extended by performance analysis with respect to the color space, the obstacle size and the time-of-day driving conditions. The learning of priors used in our model is discussed in detail and the dataset is extend with training images used for estimating the priors.
Our approach is most closely related to the works in urban-scene parsing by Felzenszwalb and Veksler 
and maritime scene understanding by Fefilatyev and Golggof, Wang et al.,  and Scherer et al. . There are notable differences between these approaches and ours. The first difference to 
is that they only address the labeling part of the segmentation problem and require precomputed per-pixel label confidences. The second difference is that their approach produces segmentations with homogenous bottom region, which prevents detection of obstacles without further postprocessing. In contrast, our approach jointly learns the component appearance, estimates the per-pixel class probabilities, and optimizes the segmentation within a single online framework. Furthermore, learning the parameters of is not as straightforward. Compared to the related water segmentation algorithms for maritime applications (i.e., [15, 16, 8]), our approach completely avoids the need for a good horizon estimation. Nevertheless, the proposed probabilistic model is general enough to directly incorporate this information if available.
The remainder of the paper is structured as follows. In Section 2 we derive our semantic generative model, in Section 3 we present the obstacle detection algorithm, in Section 4 we detail the implementation and learning of the priors, in Section 5 we present the new dataset and the accompanying evaluation protocol, in Section 6 we experimentally analyze the algorithm and draw conclusions in Section 7.
2 The semantic generative model
We consider the image as an array of measured values , in which is a
dimensional measurement, a feature vector, at the-th pixel in an image with pixels. As we detail in the subsequent sections, the feature vector is composed of pixel’s color and image coordinates. The probability of the -th pixel feature vector is modelled as a mixture model with four components – three Gaussians and a single uniform component:
where are the means and covariances of the Gaussian kernels and
is a uniform distribution. The-th pixel label
is an unobserved random variable governed by the class prior distributionwith . The three Gaussian components represent the three dominant semantic regions in the image, while the uniform component represents the outliers, i.e., pixels that do not likely correspond to any of the three structures. To encourage segmentations into three approximately vertically aligned semantic structures, we define a set of priors for the mean values of the Gaussians, i.e., . To encourage smooth segmentations, the priors as well as posteriors over the pixel class labels, are treated as random variables, which form a Markov random field. Imposing the MRF on the priors and posteriors rather than pixel labels allows effectively integrating out the labels, which leads to a well-behaved class of MRFs  that avoid image reconstruction during parameter learning. The resulting graphical model with priors is shown in Figure 3.
Let denote the set of priors for all pixels. Following 
we approximate the joint distribution over the priors as, and is a mixture distribution over the priors of the -th pixel’s neighbors, i.e., , where are fixed positive weights such that for each -th pixel . The potentials in the MRF are defined as
with the exponent defined as
is the Kullback-Leibler divergence which penalizes the differences between prior distributions over the neighboring pixels (and ), while the term is the entropy defined as
which penalizes uninformative priors . The joint distribution for the graphical model in Figure 3 can be written as
Diplaros et al.  argue that improved segmentations can be achieved by also considering an MRF directly on the pixel posterior distributions by treating the posteriors as random variables , where the components of are defined as , computed by Bayes rule from and . We can write the posterior over as , where is a mixture defined in the same spirit as . The joint distribution can now be written as
Due to coupling between / and / the optimization of (2) is not straightforward. We therefore introduce auxiliary variables and and take the logarithm, which results in the following cost function
where is the Hadamard (component-wise) product. Note that when and , (7) reduces to (2) (ignoring the constant terms). Maximization of can now be achieved in an EM-like fashion. In the E-step we maximize w.r.t. , , while the M-step maximizes over the parameters and . We can see from (7) that the is maximized w.r.t and when the divergence terms vanish, therefore, , , where and are the normalization constants.
The M-step in not as straightforward, since direct optimization over and is intractable and we resort to maximizing its lower bound. We define and and by Jensen’s inequality lower-bound the divergence terms as
where we have defined and . An appealing property of the model (7) is that its E-step can be efficiently implemented through convolutions and Hadamard products. Recall that the calculation of the -th pixel’s neighborhood prior distribution entails a weighted combination of the neighboring pixel priors . Let be the -th component priors arranged in a matrix of image size. Then the neighborhood priors can be computed by the following convolution , where is a discrete kernel with its central element set to zero and its elements summing to one. Let , and be the image-sized counterparts corresponding to sets of distributions , and , respectively, and let denote the kernel in which the central element is set to one. Then the calculation of the -th component priors for all pixels in the E-step can be written as
The EM procedure for fitting our generative model to the input image is summarized in Algorithm 1.
3 Obstacle detection
We formulate the obstacle detection as a problem of estimating an image obstacle map, i.e., determining the pixels in the image that correspond to the sea while all the remaining pixels represent the potential obstacles. We therefore first fit our semantic model from Section 2
to the input image and estimate the smoothed a posteriori probability distributionacross the four semantic components for each pixel. An pixel is classified as water if the corresponding posterior reaches maximum for the water component among all four components. In our setting the component indexed by corresponds to water region, which results in the labeled image with the -th pixel label defined as
Retaining only the largest connected region in the image results in the current obstacle image map . All blobs of non-water pixels within the connected water region are proclaimed as potential obstacles in the water. This is followed by a nonmaxima suppression stage which merges detections that are located in close proximities (e.g., due to object fragmentation) to reduce multiple detections of the same obstacle. The water edge is extracted as the longest connected outer edge of the connected region corresponding to the water. The obstacle detection is summarized in Algorithm 2 and visualized in Figure 1.
The Algorithm 1 requires initial values for the parameters and . At the first frame, when no other prior knowledge exists, we construct the initial distribution by vertically splitting the image into three regions , and , written in proportions of the image height (see Figure 4). A Gaussian is computed from each region, thus forming the observed components . The prior over all pixels is initialized to equal probabilities for the three components, while the prior on the uniform component is set to a low constant value (see Section 4). These parameters are used to initialize the EM in the Algorithm 1.
The shape for the vertical splits in Figure 4 should ideally follow the position (and inclination) of true horizon for optimal initialization of the parameters. An estimate of the true horizon depends on the camera placement and can ideally be obtained externally from an IMU sensor, but the per-frame IMU measurements are not available in the dataset that is used in our evaluation (Section 5). Therefore, an assumption is made that the horizon, as well as edge of water, is usually located within the region of image height, which is the reason this region is excluded from computation of parameter initial values. Making no further assumptions regarding the proportion between components in the final segmentation, equal regions (2 and 3) are used to initialize the parameters of the component 2 and 3. The assumption on region splitting is often violated in our dataset from Section 5 due to boat inclination at turning maneuvers, due to boat tilting forward and backward, and since the camera might not have been mounted to exactly the same spot in assembling the boat after transportation to the test site during the several months that the dataset was taken. Nevertheless, the segmentation algorithm is robust enough to handle the non-ideal initializations as long as there are no extreme deviations, like the boat toppling or riding on extremely high waves (the small coastal USVs are actually not even designed to physically endure these extreme weather conditions).
During the USV’s operation, we can exploit the continuity of sequential images in the videostream by using the parameter values of the converged model from the previous time-step for initialization of the EM algorithm in the current time-step. To reduce possible propagation of errors stemming from false segmentations in the previous time-steps, a zero-order soft reset is applied in the initialization of the EM in each time-step. In particular, the EM is initialized by merging the with . The parameters of the -th component, , are initialized by forming a weighted two-component mixture model from the -th components in and
, and approximating them by a single component by matching to first two moments of the distributions (see, e.g., Kristan et al.[38, 39, 40]). The weights and for and , respectively, can be used to balance the contribution of each component. The priors over the pixels are initialized by the smoothed posterior from the previous time-step. The initialization is sketched in Figure 4.
4 Implementation details
In our application, the measurement at each pixel is encoded by a five-dimensional feature vector , where are the -th pixel coordinates and the are the pixel’s color channels. We have determined in a preliminary study that we achieve sufficiently good obstacle detection by first performing detection on a reduced-size image of pixels and then rescaling the results to the original image size. The rescaling was set to match the lower scale of objects of interest, as smaller objects do not present danger to the USV. Such approach drastically speeds up the algorithm to approximately 10ms per frame in our experiments. The uniform distribution component in (1) is defined over the image pixels domain and returns equal probability for each pixel. Assuming that all color channels are constrained to the interval , the value of the uniform distribution is at each pixel for our rescaled image. The EM optimization requires specification of the convolution kernel . Note that the only constraint on the convolution kernel is that its central element is set to zero and all elements sum to one. We use a Gaussian kernel with its central element set to zero and set the size of the kernel to of image size, which results in a pixels kernel. Recall from Section 3.1 that the parameter influences the soft-reset of the parameters used to initialize the EM. In our implementation, a slightly larger weight is given to the parameters estimated at the previous time-step by setting .
4.1 Learning the weak priors
The spatial components in the feature vector play a dual role. First, they enforce to some extent the spatial smoothness of the segmentation on their own. Second, they lend means to weakly constraining the Gaussian components such that they reflect the three dominant semantic image parts. This is achieved by the weak priors on the Gaussian means. Since the locations and shape of semantic components vary significantly with the views, we indeed select weak priors, which are estimated using the training set from our database (see Section 5). Given a set of training images, the prior of the -th component is estimated by extracting the features, i.e. sets of , corresponding to the -th component from all images and fit a single Gaussian to them. Note that, in general, there is a chance that the training images might bias the horizontal location of the estimated Gaussian to the left or right part of the image. In this case, we could constrain the horizontal position of the Gaussians to be in the middle of the image, however, we have observed that the components of the prior estimated from our dataset are sufficiently centered and we do not apply any such constraints.
Examples of the spatial parts of the priors estimated from the training set of the dataset presented in Section 5 are shown in Figure 5. Our algorithm, as well as the learning routine, was implemented in Matlab – a reference code is publicly available at 111http://www.vicos.si/Research/UnmannedSurfaceVehicles.
5 Marine obstacle detection dataset
With lack of sufficiently large publicly available annotated dataset to test our method, we have constructed our own dataset, which we call the Marine obstacle detection dataset (Modd). The Modd consists of 12 video sequences, providing in total 4454 fully annotated frames with resolution of 640 x 480 pixels. The dataset is made publicly available along with the annotations and Matlab evaluation routines from the MODD homepage222http://www.vicos.si/Downloads/MODD.
The video sequences have been recorded from multiple platforms, most of them from the small 2.2 meter USV333A video of our USV is available online from the MODD homepage. (see Figure 1). The USV was developed by Harpha Sea, d.o.o. Koper, and is based on catamaran hull design and powered by electrical, LiPo battery powered, steerable thrust propeller. It can reach the maximum speed of 2.5 m/s and has extremely small turn radius. Steering and route planning are handled by ARM-powered MCU with redundant power supply. For navigation, the MCU relies on microelectromechanical inertial navigation unit (MEMS IMU), solid-state digital compass and differential GPS. USV has two different communication channels to the shore (high- and low-bandwith) and its mission can be programmed remotely. An Axis 207W camera was placed on the USV approximately 0.7 m above the water surface, looking in front of the vessel, with an approximately 55 field of view. Camera has been set up to automatically adjust to the variations in lighting conditions. Since the boat was being reassembled between the runs over several months, the placement of the camera varies slightly across the dataset.
The video sequences have been acquired in the gulf of Trieste, specifically in the port of Koper, Slovenia, (Figure 8) over a period of months at different times of day under different weather conditions. The USV was manually operated by a human pilot and effort was made to simulate realistic navigation, including threats of collision. The pilot was instructed to deliberately drive in a way to simulate situations in which an obstacle might present a danger to the USV. This includes obstacles being very close to the boat as well as situations in which the boat was heading straight towards an obstacle for a number of frames.
The first ten videos in the dataset are meant for evaluation of the obstacle-map estimation algorithms under normal conditions. These videos still vary quite significantly between each other and simulate conditions under which the USV is expected to operate. We thus term these ten videos as normal conditions and we show some examples of these videos in the first ten images from Figure 6. The last two videos were meant for analysis in situations in which the boat is directly facing the sun. This causes extreme changes in the automatic shutter and camera setting, resulting in significant changes of contrast and color of all three semantic components. Facing the sun also generates significant amount of fragmented glitter, while sometimes it shows up as a larger, fully connected region of the reflected sun. We thus denote these last two videos as extreme conditions. Some examples are shown in the last two images of Figure 6.
Each frame is annotated manually by a polygon denoting the edge of water and bounding boxes are placed on large obstacles (those that straddle the water edge) and small obstacles (those that are fully surrounded by water). See Figure 8 for illustration. The annotation was made by a human annotator and all annotations on all images of the dataset were later verified by an expert. To allow a fast overview of the annotations by the potential users of the dataset, the dataset provides a rendered video with annotations overlay, for each test sequence in the dataset – these videos are included as part of the dataset and available from the dataset homepage as well.
In the following some general statistics of the dataset are provided. The boat was driving within 200 meters from the shore, and most of the depicted obstacles are in this zone. Out of 12 image sequences in the dataset, nine contain either large or small obstacles, one contains only annotated sea edge, and two contain glitter annotations, sea edge annotations, and no objects. The number of objects per frame is exponentially distributed with the average 1.1 and variance 1.23. The distribution of the annotated size of small and large obstacles is shown in Figure7. For the algorithms that require training or validation of their parameters, we have compiled a collection of twenty images in which we manually annotated the pixels corresponding to three semantic components. Figure 10 shows some examples of images and the corresponding annotations.
5.1 The evaluation protocol
The evaluation protocol is designed to reflect the two distinct challenges that the USVs face: the water edge (shoreline/horizon) detection and obstacle detection. The former is measured as the root mean square error (RMSE) of the water edge position (), and the latter is measured via the efficiency of small object detection, expressed as precision (), recall (
), F-score () and the average number of false positives per frame ().
To evaluate RMSE in water edge position, ground truth annotations were used in the following way. A polygon, denoting the water surface was generated from water edge annotations. Areas, where large obstacles intersect the polygon, were removed. Note that, given the scene representation, shown in Figure 8, one cannot distinguish between large obstacles (e.g., large ships) and stationary elements of the shore (e.g., small piers). This way, a refined water edge was generated. For each pixel column in the full-sized image, a distance between water edge, as given by the ground truth and as determined by the algorithm, was calculated. These values are summarized into a single value by averaging across all frames and videos.
The evaluation of object detection follows the recommendations from the PASCAL VOC challenges by Everingham et al. , with small, application-specific modification: all small obstacles (provided as a ground truth or detected) that are closer to the annotated water line than 5% of the image height, are discarded prior to evaluation on each frame. This aims to address situations where a detection may oscillate between fully water-enclosed obstacle, and the “dent” in the shoreline, resulting in false negatives. Figure 9 shows an example with two images of a scubadiver emerging from the water. Note that in both images, the segmentation successfully labeled the scubadiver as an obstacle. But in the left-hand image we obtain an explicit detection, since the estimated water edge runs above the scubadiver. In the right-hand image the edge runs below the scubadiver and we do not get explicit detection, eventhough the algorithm successfully labeled the scubadiver’s region as being an obstacle. Note that the proposed treatment of near-edge detections/ground-truths is also consistent with the problem of obstacle avoidance – the USV is concerned primarily with the avoidance of the obstacles in its immediate vicinity. In counting false positives (FP), true positives (TP) and false negatives (FN), we follow the methodology of PASCAL VOC, with the minimum overlap set to 0.3. FP, TP and FN were used to calculate precision (), recall (), F-score () and average false positives per frame ().
In the following we will denote our obstacle image-map estimation method (Algorithm 2) as the semantic-segmentation model (). The experimental analysis was split into three parts. In the first part we evaluate the influence of the different color spaces on the ’s performance. In the second and third part we analyze how various elements of affect its performance and compare it to the alternative methods. All experiments were performed on a desktop PC with 3.06 GHz Intel Xeon E5-1620 CPU in a single thread in Matlab.
6.1 Influence of the color space
The aim of the first experiment was to evaluate how the different colorspaces affect the segmentation performance. We have therefore performed experiments in which the feature vector (Section 4) was calculated from RGB, HSV, Lab and YCrCb colorspace. For each of the selected colorspaces, the weak priors were learned from the training images on the Modd dataset (Section 5). All experiments were performed on all twelve testing videos from the Modd dataset. The results are shown in Table I.
The results show that best performance is achieved with the YCrCb and Lab colorspace, which is not surprising, since these colorspaces are known to better cluster visually-similar colors. Similar is true for the HSV space, but that space suffers from the circular property of the Hue component (i.e., red color is on the left-most and right-most part of the Hue spectrum). With respect to the edge of the water estimation, the lowest error is achieved when using the Lab colorspace, while only a slightly worse performance is obtained with the YCrCb colorspace. On all other measures, the YCrCb colorspace yields best results, although comparable to the Lab colorspace. While the results are worse when using the RGB or the HSV colorspace, we note that these results do not exhibit drastically poorer performance, which speaks of a level of robustness of the SSM to the choice of the colorspace. Nevertheless, given the results in Table I, we select the YCrCb and use this colorspace in the subsequent experiments.
6.2 Comparison to alternative approaches
Given a fixed colorspace, we are left with evaluation of how much each part of our model contributes to the final performance. We have therefore also implemented two variants of our approach, which we denote by and . In contrast to , the and do not use the MRF constraints and are in this respect only mixtures of three Gaussians with priors on their means and with a uniform component. A further difference between and was that ignored the spatial information in visual features and relied only on color.
Note that the is conceptually similar to the Grab-cut algorithm from Rother et al.  for binary segmentation, but with distinct differences. In the Grab-cut, the user provides a bounding box roughly containing the object, thus initializing the segmentation mask. Two visual models using a GMM are constructed from this segmentation mask. One for the object and one for the background. A MRF is then constructed over the pixel grid and graph cut from Boykov et al.  is used to infer an improved segmentation mask. This procedure is then iterated until convergence. There are significant differences between the proposed and the Grab-cut from . In contrast to the user-provided bounding box in , the ’s weak supervision comes from the initialization of the parameters from the previous time-step and from the weak priors. The second distinction is that our approach does require explicit estimation of the segmentation mask to refine the mixture model. This allows for a better propagation of uncertainty during the iteration of the algorithm, leading to improved segmentation.
To further evaluate contributions of the particular MRF optimization of our SSM, we have implemented a variant of the Grab-cut algorithm, which uses our semantic mixture model, but applies graph-cuts for optimization over the MRF. The resulting obstacle-map estimation tightly follows Algorithm 1 and Algorithm 2 with a slight modification of the Algorithm 1
: After each epoch of the EM, we apply the graph-cut from Bagon to segment the image into a water/non-water mask. This mask is then used as in the original Grab-cut to refine the mixture model. We use exactly the same weakly-constrained mixture model as in , and the YCrCb colorspace for fair comparison, and call this approach the Grab-cut model .
We have compared our approach also to the general segmentation approaches, namely the superpixel-based approach from Li et al. , , and a graph-based segmentation algorithm from Felzenswalb and Huttenlocher , .
For fair comparison, all the algorithms were executed on the images. We have experimented with the parameters of and and have set them to optimal performance for our dataset. Since was designed to run on larger images, we have also performed the experiments for on full-sized images – we denote this variant by . We have performed the comparative analysis separately for the normal and extreme conditions.
6.2.1 Performance under normal conditions
The results of the experiments on the normal conditions part of the Modd are summarized in Table II, while Figure 11 shows an example of typical segmentation masks from the compared algorithms. The segmentation results in these images are color coded as follows. The original image is represented only by the blue channel, manual water annotations are shown in the green channel, and algorithm-generated water segmentation is shown in the red channel. Therefore, the cyan region shows the area, which has been annotated as water, but has not been segmented as such by the algorithm (bad). The magenta region shows the area, which has not been annotated as water, but has been segmented as such by the algorithm (bad). The yellow area shows the area which has been annotated as water and has been segmented as such by the algorithm (good), and blue region shows the area which has not been annotated as water and has not been segmented as such (good). Finally, the darker band under the annotated edge of the water in all colors shows the ignore region, where evaluation of small obstacle detection does not take place.
Recall that in contrast to the , the and do not impose a Markov random field constraint. In Figure 11 this clearly leads to a poorer segmentation, resulting in false positive detections of obstacles as well as significant over-segmentations. Quantitatively, the poorer performance is reflected in Table II as a lower -measure, higher average number of false positives and larger edge of the water estimation error. Compared to , we observe a significant drop in detection quality of the , especially precision. This speaks of importance of the local labeling constraints imposed by the MRF in the . The performance further drops with , which implies that spatial components in the feature vectors bear important information for proper segmentation as well. On the other hand, the does impose a MRF, however, the segmentation is still poorer than with the . We believe that the main reason for this is that the applies graph-cuts to perform hard segmentation during EM updates. On the other hand, the optimizes the cost function within a single EM framework, thus avoiding the need for hard segmentations during the EM steps, which leads to a better final result. By far the worst segmentation results are obtained by the , and segmentation methods. Note that while these segmentation methods do assume some local consistency of segmentation, they still perform poorer than the . The improved performance of can be attributed exclusively to our formulation of the segmentation model within the graphical model from Figure 3.
Figure 12 shows further examples of SSM segmentation maps (the first fourteen images), the spatial part of the Gaussian mixture and the detected objects in water. The appearance and texture of the water varies significantly between the various scenes, and the same is true for the other two semantic components. The images also vary in the scene composition in that the vertical position as well as the attitude of the water edge (see second row in Figure 12) vary significantly. Nevertheless, note that the model is able to adapt well to these compositions and successfully decomposes the scene into water regions, in-water obstacles and fairly well delineates the water edge.
Our algorithm performed (segmentation+detection) at a rate higher than 70 frames per second. Most of the processing was spent on fitting our semantic model and obstacle-map estimation ( ms), while ms was spent on the obstacle detection. For fair comparison of segmentation algorithms, we report in the Table II only the times required for the obstacle-map estimation. Although note that the obstacle detection part did require more processing time for the methods that delivered poor segmentation masks with more false positives. On average, our EM algorithm in converged in approximately three iterations. Note that the graph cut routine in , part of and the were implemented in C and interfaced to Matlab, while all the other variants were entirely implemented in Matlab. Therefore, the computational time results for segmentations are not directly comparable among the methods, but still offer a level of insight. In terms of processing time, the ’s segmentation was the fastest, running at 100 frames per second. The and performed approximately as fast as , followed by , , and . We conclude that the came out on top as the fastest method that also achieved the best detection performance as well as accuracy.
6.2.2 Performance under extreme conditions
We were interested in measuring two properties of the algorithms under conditions when the boat is facing the sun. In particular, were interested in measuring how the sun affects the edge-of-water estimation and how the glitter affects the detection. We have therefore repeated two variants of the experiments on the videos in Modd denoted by the extreme conditions (videos 11 and 12 in Figure 6). In the first variant, we ignored any detections in the regions that were denoted as glitter regions in ground truth. In the second variant, all detections were accounted for. Note that the videos denoted as extreme conditions do not contain any objects, therefore there were no true positives and any detected object was a false positive. Because of this, we present in the results (Table III) only the edge of water estimation error and the average number of false positives (by definition, both, accuracy and precision, would be zero in such case).
In terms of edge of water estimation, the slightly outperforms the . The ignores the spatial information and generally oversegments the regions close to the shoreline (as seen in Figure 11), which in this case actually reduces the error compared to . The reason is that the attributes the upper part of the sun reflection at the shoreline in video 11 (Figure 6) to an obstacle instead of the water. When ignoring the glitter region, the outperforms the competing methods by not detecting any false positives (zero ), while the competing methods exhibit larger values of the false positives. When considering also the glitter region, the number of false positives only slightly increases for the , while this increase is considerable for the other methods. Note that in this case the again significantly outperforms the other methods, except for . The reason is that the actually fails by grossly oversegmenting the water region, thus assigning almost all glitter to that region. However, looking at the results of the edge estimation, we can also see that this oversegmentation actually consumes also a part of the shoreline, thus leading to poor overall segmentation. Among the remaining methods, the again achieves the lowest average false positive rate. Given these results we conclude that the is much more robust to extreme conditions than the competing methods, while still offering good segmentation results. Some examples of segmentation with are shown in the last four images of Figure 12. Even in these harsh conditions the model is able to interpret the scene well enough with few false obstacle detections. For more illustrative examples of our method and segmentations, please consult the additional online material at http://box.vicos.si/matejk/smc/index.htm.
6.2.3 Failure cases
An example of conditions is which the segmentation is expected to fail is shown in the bottom-most right image of Figure 12. In this image, the boat is facing a low-laying sun directly, which results in a large saturated glitter on the water surface. Since the glitter occupies a large region, and is significantly different from the water, it is detected as an obstacle. Such cases could be handled by image postprocessing, but at a risk of missing true detections. Nevertheless, additional sensors like compass, IMU and sun-position model can be used to identify a detected region as a potential glitter. To offer further insights of the constraints of the proposed segmentation, we show additional failure cases in Figure 13. Figure 13a shows failure due to a strong reflection of the landmass in the sea, while Figure 13b shows an example of failure due to blurred transition from the sea to sky. Note that in both cases, the edge of the sea is conservatively estimated, meaning that true obstacles were not mislabelled as water, but rather portions of water were labelled as obstacle. Figure 13c shows an example in which several obstacles are close-by and are not detected as separate obstacles, but rather as part of the edge of water. An example of potentially dangerous mislabelling is shown in Figure 13d, where a part of the boat on the left is deemed visually-similar to water and is labelled as such. Note, however, that this mislabelling is corrected in the subsequent images in that video.
6.3 Effects of the target size
Note that all obstacles may not pose equal threat to the vessel. In fact, smaller objects are likely not people and may also likely pose little threat, since they can be run over without damaging the vessel. To inspect our results in such a context, we have compiled the results over all videos with respect to the minimum object’s size. Any object, whether in the ground truth or in detection, was ignored if its size was smaller than a predefined value. We also ignored any detected object that overlapped with the removed ground truth detection by 0.3. This last condition addresses the fact that some objects in the ground truth are slightly smaller than their detected size, which would generate an incorrect false positive if the ground truth object was removed. Figure 14 visualizes the applied thresholds, while the results are given in Table IV.
The results show that the detection remains high over a range of small thresholds, which speaks of a level of robustness of our approach. By increasing thresholds above the precision as well as the recall increase the probability of detecting a false positive in a given frame is drastically reduced. This means that, as the objects approach the USV and get bigger, they are increasingly reliably detected. This is also true for the sufficiently big objects that are far away from the USV. The following rule-of-thumb calculation for the big or approaching objects can be performed. Let us assume that a successful detection means any detection of a true obstacle if we detect it at least once in consecutive frames. The probability of a successful detection is therefore
If we do not apply any thresholding, we can detect any object, regardless of its size with probability . The probability of a false positive occurring in any image is 0.055. By applying a small threshold, the detection remains unchanged, but the probability of a false positive occuring in a particular frame goes down to . If we chose to focus only on the objects that are at least thirty by thirty pixels large, then the probability of detection goes up to , and the probability of detecting a false positive in any frame goes down to . It should be noted that the model in (14) assumes independence of detections over the sequence of images. While such assumptions may indeed be restrictive for temporal sequences, we still believe that the model gives a good rule-of-thumb on expected real-life obstacle detection performance of the segmentation algorithm.
7 Discussion and conclusion
A graphical model for semantic segmentation of marine scenes was presented and applied to USV obstacle-map estimation. The model exploits the fact that scenes a USV encounters may be decomposed into three dominant visually- and semantically-distinctive components, one of which is the water. The appearance is modelled by a mixture of Gaussians and accounts for the outliers by a uniform component. The geometric structure is enforced by placing weak priors over the component means. A MRF model is applied on prior and posterior pixel-label distribution to account for the interactions across neighboring pixels. An EM algorithm is derived for fitting the model to image, which affords fast convergence and efficient implementation. The proposed model directly applies straight-forward features, i.e., color channels and pixel positions and avoids potentially slow extraction of more complex features. Nevertheless, the model is general enough to be directly applied without modifications to any other features. A straightforward approach for estimation of the weak prior model was proposed, that allows learning from a small number of training images and does not require accurate annotations. Results show excellent performance compared to related segmentation approaches and exhibits improved performance in terms of segmentation accuracy as well as speed.
To evaluate the performance and analyze our algorithm, we have compiled and annotated a new real-life coastal line segmentation dataset captured from an onboard marine vehicle camera. This is the largest dataset of its kind to date and is as such another contribution to the field of robotic vision. We have studied the effects of the colorspace selection on the algorithm’s performance. We conclude that the algorithm is fairly robust to this choice, but obtains best results at YCrCb and Lab colorspaces. The experimental results also show that the proposed algorithm significantly outperforms the related solutions. While the algorithm provides high detection rates at low false positives it does so with a minimal processing time. The speed comes from the fact that the algorithm can be implemented through convolutions and from the fact that it preforms robustly on small images. The results have also shown that the proposed method outperforms the related methods by a large margin in terms of robustness in the extreme conditions, when the vehicle is facing the sun, as well. To make the present paper a reproducible research and to facilitate other researchers in comparing their work to ours, the Modd dataset is made publicly available, along with all the Matlab evaluation routines, a reference Matlab implementation of the presented approach and the routines for learning the weak priors.
Note that the fast performance is of crucial importance for real-life implementations on USVs, as it allows the use in onboard embedded controllers and low-cost embedded, low-resolution cameras. Our future work will focus on two extensions of our algorithm. We will explore possibilities of porting our algorithm to such an embedded sensor. Since many modern embedded devices contain GPUs, we will also explore parallelization of our algorithm by exploiting the fact that it is based on convolution operations, which can be efficiently parallelized. Our model is fully probabilistic and as such affords a principled way for information fusion, e.g., , to improve performance. We will explore combinations with additional external sensors such as inertial sensors, cameras of other modalities and stereo systems. In particular, IMU can be used to modify the priors and soft reset parameters on-the-fly as well as estimating the position of the horizon in the images. The segmentation model can then be constrained by hard-assigning pixels above the horizon to the non-water class. Temporal constraints on segmentation can be further imposed by image-based ego-motion estimation using techniques from structure-from-motion.
H. Heidarsson and G. Sukhatme, “Obstacle detection from overhead imagery using self-supervised learning for autonomous surface vehicles,” inInt. Conf. Intell. Robots and Systems, 2011, pp. 3160–3165.
-  C. Rasmussen, Y. Lu, and M. K. Kocamaz, “Trail following with omnidirectional vision,” in Int. Conf. Intell. Robots and Systems, 2010, pp. 829 – 836.
-  M. Montemerlo, S. Thrun, H. Dahlkamp, and D. Stavens, “Winning the darpa grand challenge with an ai robot,” in AAAI Nat. Conf. Art. Intelligence, 2006, pp. 17–20.
-  H. Dahlkamp, A. Kaehler, D. Stavens, S. Thrun, and G. Bradski, “Self-supervised monocular road detection in desert terrain,” in RSS, Philadelphia, USA, August 2006.
-  S. M. Ettinger, M. C. Nechyba, P. G. Ifju, and M. Waszak, “Vision-guided flight stability and control for micro air vehicles,” Advanced Robotics, vol. 17, no. 7, pp. 617–640, 2003.
-  Y. Lu and C. Rasmussen, “Simplified markov random fields for efficient semantic labeling of 3D point clouds,” in IROS, 2012.
-  A. Rankin and L. Matthies, “Daytime water detection based on color variation,” in Int. Conf. Intell. Robots and Systems, 2010, pp. 215–221.
-  S. Scherer, J. Rehder, S. Achar, H. Cover, A. Chambers, S. Nuske, and S. Singh, “River mapping from a flying robot: state estimation, river detection, and obstacle mapping,” Auton. Robots, vol. 33, no. 1-2, pp. 189–214, 2012.
-  C. Onunka and G. Bright, “Autonomous marine craft navigation: On the study of radar obstacle detection,” in ICARCV, 2010, pp. 567–572.
-  H. Heidarsson and G. Sukhatme, “Obstacle detection and avoidance for an autonomous surface vehicle using a profiling sonar,” in ICRA, 2011, pp. 731–736.
-  L. Elkins, D. Sellers, and W. M. Reynolds, “The autonomous maritime navigation (amn) project: Field tests, autonomous and cooperative behaviors, data fusion, sensors, and vehicles,” Journal of Field Robotics, vol. 27, no. 6, p. 790 818, 2010.
-  T. H. Hong, C. Rasmussen, T. Chang, and M. Shneier, “Fusing ladar and color image information for mobile robot feature detection and tracking,” in IAS, 2002.
-  P. Santana, R. Mendica, and J. Barata, “Water detection with segmentation guided dynamic texture recognition,” in IEEE Robotics and Biomimetics (ROBIO), 2012.
-  D. Socek, D. Culibrk, O. Marques, H. Kalva, and B. Furht, “A hybrid color-based foreground object detection method for automated marine surveillance,” in Advanced Concepts for Intelligent Vision Systems. Springer, 2005, pp. 340–347.
S. Fefilatyev and D. Goldgof, “Detection and tracking of marine vehicles in
Proc. Int. Conf. Pattern Recognition, 2008, pp. 1–4.
-  H. Wang, Z. Wei, S. Wang, C. Ow, K. Ho, and B. Feng, “A vision-based obstacle detection system for unmanned surface vehicle,” in Int. Conf. Robotics, Aut. Mechatronics, 2011, pp. 364–369.
-  T. Huntsberger, H. Aghazarian, A. Howard, and D. C. Trotz, “Stereo vision based navigation for autonomous surface vessels,” JFR, vol. 28, no. 1, pp. 3–18, 2011.
-  S. Khan and M. Shah, “Object based segmentation of video using color, motion and spatial information,” in Comp. Vis. Patt. Recognition, vol. 2, 2001, pp. 746–751.
-  T. M. Nguyen and Q. M. J. Wu, “A nonsymmetric mixture model for unsupervised image segmentation,” IEEE Trans. Cybernetics, vol. 43, no. 2, pp. 751 – 765, 2013.
-  N. Nasios and A. Bors, “Variational learning for gaussian mixture models,” IEEE Trans. Systems, Man and Cybernetics, B, vol. 36, no. 4, pp. 849–862, 2006.
-  P. Felzenszwalb and D. Huttenlocher, “Efficient graph-based image segmentation,” Int. J. Comput. Vision, vol. 59, no. 2, pp. 167–181, 2004.
-  J. Besag, “On the statistical analysis of dirty pictures,” Journal of the Royal Statistical Society, vol. 48, pp. 259–302, 1986.
-  Y. Boykov and G. Funka-Lea, “Graph cuts and efficient nd image segmentation,” International Journal of Computer Vision, vol. 70, no. 2, pp. 109–131, 2006.
-  P. F. Felzenszwalb and O. Veksler, “Tiered scene labeling with dynamic programming,” in CVPR, 2010.
-  C. Wojek and B. Schiele, “A dynamic conditional random field model for joint labeling of object and scene classes,” in ECCV, 2008, pp. 733 – 747.
-  J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proc. Int. Conf. Mach. Learning, 2001, pp. 282 – 289.
P. Kontschieder, S. Bulo, H. Bischof, and M. Pelillo, “Structured class-labels in random forests for semantic image labelling,” inICCV, 2011, pp. 2190–2197.
-  S. Chen and D. Zhang, “Robust image segmentation using fcm with spatial constraints based on new kernel-induced distance measure,” IEEE Trans. Systems, Man and Cybernetics, B, vol. 34, no. 4, pp. 1907 – 1916, 2004.
-  T. M. Nguyen and Q. Wu, “Gaussian-mixture-model-based spatial neighborhood relationships for pixel labeling problem,” IEEE Trans. Systems, Man and Cybernetics, B, vol. 42, no. 1, pp. 193 – 202, 2012.
-  S. Makrogiannis, G. Economou, S. Fotopoulos, and N. Bourbakis, “Segmentation of color images using multiscale clustering and graph theoretic region synthesis,” IEEE Trans. Systems, Man and Cybernetics, B, vol. 35, no. 2, pp. 224 – 238, 2005.
-  W. Tao, H. Jin, and Y. Zhang, “Color image segmentation based on mean shift and normalized cuts,” IEEE Trans. Systems, Man and Cybernetics, B, vol. 37, no. 5, pp. 1382 – 1389, 2007.
-  M. Alpert., S.Galun, R. Basri, and A. Brandt, “Image segmentation by probabilistic bottom-up aggregation and cue integration,” in CVPR, 2012, pp. 1–8.
-  Z. Li, , X. M. Wu, and S. F. Chang, “Segmentation using superpixels: A bipartite graph partitioning approach,” in CVPR, 2012.
-  X. Ren and J. Malik, “Learning a classification model for segmentation,” in ICCV, 2003, pp. 10 – 17.
-  H. Lu, R. Zhang, S. Li, and X. Li, “Spectral segmentation via midlevel cues integrating geodesic and intensity,” IEEE Trans. Cybernetics, vol. 43, no. 6, pp. 2170 – 2178, 2013.
-  A. Diplaros, N. Vlassis, and T. Gevers, “A spatially constrained generative model and an em algorithm for image segmentation,” IEEETNN, vol. 18, no. 3, pp. 798 – 808, 2007.
-  M. Kristan, J. Perš, V. Sulić, and S. Kovačič, “A graphical model for rapid obstacle image-map estimation from unmanned surface vehicles,” in Proc. Asian Conf. Computer Vision, 2014.
M. Kristan, A. Leonardis, and D. Skočaj, “Multivariate Online Kernel Density Estimation with Gaussian Kernels,”Patt. Recogn., vol. 44, no. 10–11, pp. 2630–2642, 2011.
-  M. Kristan and A. Leonardis, “Online discriminative kernel density estimator with gaussian kernels,” IEEE Trans. Cybernetics, vol. 44, no. 3, pp. 355 – 365, 2014.
-  S. Wang, J. Wang, and F. Chung, “Kernel density estimation, kernel methods, and fast learning in large data sets,” IEEE Trans. Cybernetics, vol. 44, no. 1, pp. 1 – 20, 2014.
-  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” IJCV, vol. 88, no. 2, pp. 303–338, June 2010.
-  C. Rother, V. Kolmogorov, and A. Blake, “GrabCut: interactive foreground extraction using iterated graph cuts,” in SIGGRAPH, vol. 23, no. 3, 2004, pp. 309–314.
-  Y. Boykov, V. O., and R. Zabih, “Fast approximate energy minimization via graph cuts,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 11, pp. 1222–1239, 2001.
-  S. Bagon, “Matlab wrapper for graph cut,” December 2006. [Online]. Available: http://www.wisdom.weizmann.ac.il/~bagon
-  D. Tick, A. Satici, J. Shen, and N. Gans, “Tracking control of mobile robots localized via chained fusion of discrete and continuous epipolar geometry, imu and odometry,” IEEETC, vol. 43, no. 4, pp. 1237 – 1250, 2013.