Are object detection assessment criteria ready for maritime computer vision?

09/12/2018 ∙ by Dilip K. Prasad, et al. ∙ 6

Maritime vessels equipped with visible and infrared cameras can complement other conventional sensors for object detection. However, application of computer vision techniques in maritime domain received attention only recently. Maritime environment offers its own unique requirements and challenges. Assessment of quality of detections is a fundamental need in computer vision. However, the conventional assessment metrics suitable for usual object detection are deficient in maritime setting. Thus, a large body of related work in computer vision appears inapplicable to maritime setting at the first sight. We discuss the problem of defining assessment metrics suitable for maritime computer vision. We consider new bottom edge proximity metrics as assessment metrics for maritime computer vision. These metrics indicate that existing computer vision approaches are indeed promising for maritime computer vision and can play a foundational role in the emerging field of maritime computer vision.



There are no comments yet.


page 1

page 4

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Maritime vessels (MV) are equipped with sensors such as radar, sonar and LIDAR for situational awareness. Automatic identification system (AIS) supports traffic data exchange over maritime communication channels, through which each MV with on-board AIS declares its position, speed, and intended path. The International Regulations for Preventing Collisions at Sea 1972 (COLREGs) impose that all cargo ships weighing more than 300 tonnes and all passenger ships are equipped with AIS. There is no such imposition on smaller MVs, including fishing boats and small-medium sized cargo MVs. Such MVs are invisible in traffic data. Moreover, AIS channel may be inaccessible for several minutes to few hours at a time [1].

Cameras in the visible and infrared (IR) range now play a complementary role by overcoming disadvantages of traditional sensors like the minimum range associated with radars and sonars [2, 3]. Thus, computer vision (CV) techniques should play an important role in detecting objects in the maritime environment, especially in detecting small and medium sized MVs that have weak radar or sonar signatures and lack on-board AIS.

Maritime CV for object detection faces several challenges. Maritime video streams are characterized by scene flatness, i.e. lack of landmarks and marked lanes as in roads. The maritime scene offers difficult to model dynamic background featured by challenges such as semi-stochastic wave background, sharp contrasts of wakes, possibilities of occlusion of MVs, and weather and illumination conditions such as rain, haze and glint [4]. Further, planning the manoeuver and deceleration for collision avoidance (CA) is challenging since the distance and span of the MVs in the scene is related non-linearly to the pixels along the axis [5, 6], see Fig. 1(a).

Fig. 1: What is an acceptable detection of a maritime vessel? (a) Physical distances in maritime scene vary non-linearly in image space [5, 6]

. Collision avoidance requires accurate estimate of the distance, which is related to the bottom edge of the vessel, and the minimum span of a maritime object. (b) In the 10 examples above, green boxes denote ground truth, blue boxes denote acceptable detection, and red boxes denote unacceptable detections.

An appropriate maritime CV solution has to satisfy the following requirements:

detect and track MVs in the scene

determine accurate spans, positions and tracks of MVs

provide real-time results

perform in all weathers and illuminations

Detection and tracking of MVs falls under the ensemble problem set of ‘detection and tracking in dynamic background’, which has been extensively studied in computer vision. The existing CV solutions in this ensemble can provide a firm foundation for developing dedicated CV solutions for maritime object detection requirements. Adoption of these solutions for maritime CV encounters a set back. As we show, traditional performance measures for object detection fail in the maritime environment and open the following question. How do we assess the quality of detection for maritime computer vision?

We show that assessment metrics such as intersection over union (IOU) and intersection over ground truth (IOG), most often used in object detection, are unsuitable for maritime CV. They are deficient in assessing the accuracy of span and distance of detected MVs. Either the detection method provides a very high IOU, say 90%, or customized assessment metric is needed to meet the requirements of maritime CV. Designing custom assessment metrics that provide good assessment of the quality of detected objects while not putting severe demands on detection algorithms is the aim of this paper.

We discuss two new assessment metrics customized for maritime computer vision. We also study the performance of existing background subtraction (BGS) algorithms and regions with convolution neural network (R-CNN) features using conventional and proposed assessment metrics. We show that the conventional metrics indicate general unsuitability of BGS algorithms for maritime CV whereas the new metrics present hope of using them in maritime CV. We expect that this exercise shall provide useful cursors for developing maritime CV solutions.

The assessment requirements of maritime CV are discussed in section II. The deficiency of conventional metrics for maritime CV is discussed in section III. The proposed bottom edge proximity metrics are presented and compared with conventional metrics in section IV. Experimental results of existing BGS algorithms and R-CNN on a maritime dataset are presented in section V. Section VI concludes this paper with a discussion on the future outlook for maritime CV.

Fig. 2: The notations relevant to the conventional metrics and the proposed bottom edge proximity (BEP) metrics are shown here.

Ii Requirements for maritime CV

Before discussing the suitability of conventional metrics, or lack thereof, we consider the fundamental question: ‘What is an acceptable detection of a maritime vessel?’. It is important to accurately estimate the location of the MV in a scene (given by the bottom edges of the MV) and its minimum span (determined by the width of the MV in pixels and its position in the image frame). See Fig. 1(a) for illustration. Consider the example cases 1-10 shown in Fig. 1(b). Example 1 is close to ideal, where the bounding box (BB) of the detected object (DO) is almost the same as the BB of the ground truth (GT).

Although there is a large variety of MVs, in general, an MV is characterized by a hull and an optional super-structure, i.e. all parts above the hull, including masts. The existing CV solutions may detect hull and super-structure separately due to two reasons. First, super-structure is not an essential component and supervised learning approaches may undertrain for vehicles with super-structures. Second, stark differences in geometries, color, and other image features of the hull and the super-structure implies that the super-structure may appear as an independent object. The hull or the super-structure may even be left undetected, such as in the case of sailboats, due to lack of contrast between the background and the super-structure. Consequently, DO may appear as shown in examples 2-4.

For collision avoidance, accurate detection of the hull is important, irrespective of whether the super-structure is included in the DO with hull (example 1), detected independently (example 4), or not detected at all (examples 2 and 3). Furthermore, the physical distance between the MV and the sensor is mapped non-linearly in an image along a direction perpendicular to the horizon (see Fig. 1(a)). This means that the line in image corresponding to horizon is at infinity while the bottom most pixel is only a few meters away from the sensor. Thus, incorrect estimation of bottom of hull may result in hugely incorrect estimation of the physical distance. However, it is preferable to slightly underestimate the distance between the sensor and an MV for collision avoidance, rather than overestimate it. In this sense, DOs in examples 2 and 3 of Fig. 1(b) are acceptable.

Current BGS solutions for object detection struggle with the presence of wakes of maritime vessels [4]. Often wakes are detected as part of the MVs, such as shown in examples 5-7. Similar to the logic of underestimating the distance between the sensor and the detected MV, it is safer if the estimated width is not lesser than the actual span. Thus, horizontal wakes becoming a part of DO is acceptable, though not preferable. However, large extension of the DO in the vertical direction below the hull may result in grossly incorrect estimate of distance, and is not preferred (see example 7).

The condition of occlusion has a significant implication on collision avoidance. The extension of DO due to occlusion in any direction may mean that the MV with smaller pixel footprint is not detected (see examples 8-10). Though the DO for all these examples are not preferred, the implications are much more severe for examples 9-10, which involve a small MV (kayak) with no on-board communication channel and poor detectability in radar and sonar. These situations call for a close to perfect overlap between the DO and the GT. However, even between examples 9 and 10, example 10 is the least preferred detection. In example 10, the DO leads to gross underestimation of the location of large MV and missed detection of a kayak that is much closer to sensor, much agile, and invisible in other communication and sensor streams.

Fig. 3: The current metrics are unsuitable for assessing detected objects in maritime CV. For the same values of a, b, and c, one DO may be preferred over others (a,b). Increasing IOU, Dice Index, or IOG metrics need not indicate better detections (c).

Fig. 4: BEP is sensitive to the bottom edges of the DO and GT (a). is more strict than (b). is more strict than (c). Thus is more strict than .

Iii Conventional criteria vs. maritime CV needs

Assessment of the quality of detection is usually performed through similarity metrics, such as Jaccard index (also called IOU) or Dice index. Their generalized form is given by Twersky index

[7], defined as follows:


where are the areas of , , and , respectively (see Fig. 2(a)). The parameter emphasizes the allegiance of the overlapped region with GT while the parameter emphasizes the allegiance of the overlapped region with DO. Similarity metrics usually employ symmetry with respect to GT and DO, i.e. . Dice index corresponds to and widely used IOU corresponds to . A detection is assessed as true positive if IOU. Similar threshold is employed if other similarity metrics are used. Usually in CV, IOU0.5 is considered sufficient. We consider an additional asymmetric metric with , , which we refer to as intersection over ground truth (IOG). This metric assesses the intersection area with respect to the area of GT () only. Thus excess span detection due to wakes (examples 5-7 in Fig. 1(b)) or excess detection in vertical direction below the hull (example 3 in Fig. 1(b)) do not affect the assessment negatively if the metric IOG is used.

The essential problem with the above metrics is that two cases may have the same areas , but one case may be a preferred detection over another. See Fig. 3(a,b) for examples. Also, the increasing value of the above mentioned metrics need not imply better detection, as shown in Fig. 3(c). New metrics that account specifically for the importance of the bottom edge of hull are needed.

Iv Proposed bottom edge proximity criteria

We consider two new criteria that specifically judge the accuracy of detection of the bottom edge (BE) and the span of the DO. We call them bottom edge proximity 1 (BEP appears here for the first time) and bottom edge proximity 2 (BEP, recently proposed in [8]). BEP is symmetric with respect to DO and GT while BEP is biased towards allegiance with GT. We use the notations in Fig. 2(b) for the definitions of BEP and BEP presented next.

Bottom edge proximity 1 (BEP)

We define where


The smaller the distance between the edges of the GT and DO, the larger is . See Fig. 4(a). However, if the DO is significantly smaller than GT, becomes poorer. Thus, it indirectly embeds the vertical size of DO in comparison with GT. This is shown in Fig. 4(c).

Bottom edge proximity 2 (BEP)

We define where


We note that is stricter than . This is because is less tolerant to extended span of DO due to wakes as well as occlusions, as shown in Fig. 4(b). Further, is sensitive to the size of DO if the DO is smaller than the GT, as shown in Fig. 4(c).

For convenience, we refer to and as metrics. Similarly, we refer to and as metrics. An advantage of BEP metrics is that the threshold(s) for assessing a detection as true positive can be chosen flexibly. Either a single threshold can be used for the net BEP score, or two thresholds and can be considered for and metrics independently, and a TP can be assessed if both conditions and are satisfied.

Qualitative comparison for examples in Fig. 1(b)

We perform a qualitative comparison of the metrics IOU, Dice index, IOG, , and on the examples in Fig. 1(b), which were used to study acceptable and unacceptable detections for maritime CV. The results are shown in Table I. We briefly discuss the selection of the thresholds (given in parentheses) for the metrics. Since the threshold value of is conventionally used in object detection [9], we use this value for IOU. Similarly, we use as threshold for the Dice index and IOG as well. Since and are 1-dimensional analogues of the 2-dimensional IOU and IOG, we use a threshold value of . Lastly, we use threshold value of because the accuracy of bottom edge is critical in collision avoidance.

As discussed before, conventional metrics that use shown in Fig. 3 are not suitable for assessing detections in maritime CV. This is evident in Table I, where IOU, Dice index, and IOG have successes for less than half the number of examples. performs better, getting 6 successes out of 10 examples. performs the best, getting success in all the 10 examples. We further study the and metrics, also provided in Table I. Notably, is less strict in assessing TPs, assessing all DOs as true positives. In , consequently plays the role of suitable metric, providing correct assessment for all the 10 examples. is only slightly poorer than , providing 8 correct assessments out of 10. Thus, the role of bottom edge in correct assessment is verified.

Example 1 2 3 4 5 6 7 8 9 10 Number of Successes
Maritime CV TP TP TP FP TP TP FP FP FP FP Not applicable
Dice (0.5) TP FP FP TP TP FP TP TP TP TP 2
(0.7,0.75) TP TP TP FP FP FP FP TP TP FP 6
(0.7,0.75) TP TP TP FP TP TP FP FP FP FP 10
TABLE I: Qualitative comparison of metrics for examples in Fig. 1(b) is given here. The thresholds used for determining TPs are given in parentheses. For BEPs, () are given. The number of successes is the number of times a metric assesses the example as acceptable for maritime CV (i.e. number of matches with the maritime CV row).
Group Methods in the group
Spatio-temporal filters (STF) - Temporal mean (TM) [10], Prati’s median (PM) [11], adaptive median (AM) [12], BGS [13]
Gaussian models (GM) - Simple Gaussian (SG) [14], Gaussian average (GA) [15]

, Grimson’s Gaussian mixture model (GMM)

[16], Zivkovic’s adaptive GMM (AGMM) [17], mixture of Gaussians (MoG) [18], fuzzy Gaussian (FG) [19], type-2 fuzzy GMM - uncertain mean (T2FUM) [20]

, type-2 fuzzy GMM - uncertain variance (T2FUV)

Kernel models (KM) - Kernel density estimation (KDE) [21], VuMeter [22]
Self organizing maps (SOM) - Adaptive self organizing maps (ASOM) [23], fuzzy ASOM (FASOM) [23]
Low rank and sparsity (LRM) - Eigen-background (EB) [24]

, active subspace (AS) robust principal component analysis (RPCA)

[25], fast (F) principal component pursuit (PCP) [26], Reimanian robust (R2) PCP [27], MoG-RPCA [28], non-convex (NC) RPCA [29], Grassman average [30], greedy semi-soft go decomposition (GreGoDec) [31], orthogonal rank-one matrix pursuit (OR1MP) [32], Grassmannian rank-one update subspace estimation (GROUSE) [33], low-rank matrix completion by Riemannian optimization (LRGeomCG) [34], non-negative matrix factorization (NMF) with sparse matrix (LS2) [35], Deep semi NMF (DSNMF) [36], alternating direction method of multipliers (ADMM) [37], robust orthonormal subspace learning (ROSL) [38]
Texture, color, and regions (TCR) - Texture BGS (TBGS) [39], independent multimodal background subtraction (IMBS) [40], multicue [41], local binary similarity segmenter (LOBSTER) [42], self-balanced sensitivity segmenter (SuBSENSE) [43]
TABLE II: List of background subtraction methods is presented here. The methods are grouped according to the central concept behind them. The best results of each group appear in Table III. The number of methods in each group is indicated in .

V Experiments and results

Detection of MVs in maritime environment falls under the ensemble problem set of ‘detection in dynamic background’. CV methods solve it by modeling and subtracting the dynamic background, followed by segmentation of the foreground [bouwmans2014traditional, bouwmans2014background]

. The dataset and the dynamic background subtraction methods used here are described below. We consider deep learning also for detection of MVs. These details are presented, followed by quantitative and qualitative results.


We use on-shore (fixed camera) visible range maritime videos from the maritime dataset published with [4]. There are 34 high-definition videos taken from Canon 70D cameras, Canon EF 70-300mm f/4-5.6 IS USM. Dataset has been captured at different times, such as before sunrise, at sunrise, at mid day, in the afternoon, in the evening, and 2 hours after sunset. We excluded the videos taken in haze and rain to avoid additional challenges. BBs of objects in each frame of the video are provided along with the dataset. Each BB is labeled with one of the following class labels: boat, buoy, ferry, flying bird/plane, kayak, sailboat, speed boat, vessel/ship, and others.

Dynamic background subtraction (BGS) methods tested here

We tested 22 BGS methods from the BGS library named bgslibrary [18, 44] and 14 BGS methods from the low rank and sparse (LRS) tools library name lrslibrary [45]. Default parameters have been used for all the methods. Parameter tuning for achieving the best performance for each method is out of the scope of this work. All detected BBs less than 20 pixels in any dimension are rejected as obviously spurious detections. We group the 36 methods into six broad categories based on their central concept. The groups and the methods in each of them are listed in Table II. Among the 36 methods, only IMBS has been developed specifically for maritime scenes.

Regions with convolution neural network (R-CNN) features for detection using deep learning

We conducted two experiments in deep learning. First, we randomly selected 20 videos from the dataset for training and trained R-CNN [46] with AlexNet architecture. The results for this experiment were extremely poor and are not reported here. We attribute the poor performance to the challenging nature of the maritime scene and consider that maritime scenes may require camera and illumination specific training. In the second experiment, we formed the training dataset using every fifth frame of all the videos. The objective was to test if R-CNN can detect the objects it has been trained for. R-CNN trained on CIFAR-10 architecture performed poorly but R-CNN trained on AlexNET provided better results. We note that use of R-CNN here [46] is a first attempt of deep learning for maritime CV. Better suited approaches may be identified in the future. Some options include faster R-CNN [47], long-term temporal convolution CNNs [48], networks on convolutional feature maps CNN [49].

(a) Example frame 1 (b) Example frame 2
(c) Example frame 3 (d) Example frame 4
Fig. 5: Example results of CV methods for detection through dynamic BGS. The subtracted background appears white in the results of the methods. Ground truths: yellow BBs. Detected objects (foreground segmentations obtained after BGS): blue or red BBs. Red BBs: DOs referred in the text.

Qualitative examples

We consider four example frames, each taken from a different video of the dataset. The detection results of 10 BGS methods and R-CNN are shown in Fig. 5. The selected BGS methods are the ones that consistently outperform other methods in their groups either is precision or in recall. These methods are identified in Table III. All BGS methods are ineffective in subtracting the background. In Fig. 5, all BGS methods except SuBSENSE detect false positive objects in the water background. This problem is more severe in frames 3 and 4, which show relatively more turbulent waters.

Consider fast moving objects in Fig. 5: E in frame 1, A in frame 2, and D in frame 4. Most methods generate phantom foreground for these objects, exceptions include Prati’s median, SuBSENSE, and IMBS. Such phantoms may result into one wider detection or multiple individual detections, see KDE results for object A in frame 2 and object E in frame 1 for respective examples. These examples indicate a challenge not recognized in [4]. Dynamic BGS should incorporate large variations in the speeds of the vessels (both in the physical scales and the image scales) for avoiding phantom detections of fast vessels.

Wakes result in wider BBs in most methods for object D in frame 4. The detected spans of the fast moving objects and the objects with wakes are larger than the actual objects. For a fast moving object, information of minimum span and bottom edge is critically important for collision avoidance. It does not hurt to interpret a larger span than the actual span, although it is not preferred. Thus, despite wider BBs, these detections are useful for collision avoidance. The BB of SuBSENSE corresponding to object A in frame 2 is comparatively less acceptable, since it underestimates the span of the vessel. IOU (0.5) estimates it as true positive, even though this detection indicates deficiency of SuBSENSE for collision avoidance. Also, note that fuzzy Gaussian BGS generates one significantly larger BB for each example frame, with the bottom edge of BB much below a GT’s bottom edge. IOG detects it as a true positive, even though such detections are clearly deficient for collision avoidance.

Now, consider object A in frame 1 and objects B-D in frame 3. For these objects, several methods detect either the super-structure or the hull. Or, they break down the object into several smaller detections (note object A of frame 1). While the detected hulls indicate acceptable performance for collision avoidance, the detected super-structures or portions of the objects are unacceptable. BEPs are effective in assessing both these conditions appropriately.

Frame 2 presents an example of several occluded objects with small pixel foot prints. Different methods give varied results, several of them being useful for an initial estimate. This indicates potential for CV methods. However, suppression of false positive detections in water background is important for reasonable conclusion. At the same time, situations such as example 9 from Fig. 1(b) also occur in numerous places. See for example, the results of eigen-background and KDE for the example frame 2. Even with the BEP metrics, assessing them appropriately for collision avoidance in maritime CV is an open problem.

The results of R-CNN for the four example frames indicate that detections using R-CNN are better and less affected by wake. Moreover, DOs typically span both the hull and the super-structure. We note that the current implementation detects the same objects that it has been trained for, which is the reason for better quality of DOs. This approach is suitable only where environment specific training is feasible and practically useful.

Quantitative results

We assess the true positive (TP) detections in all the frames of the all the videos in the dataset. The precision for the entire dataset is computed as the ratio of the total number of TPs to the total number of DOs. The recall is computed as the ratio of the total number of TPs to the total number of GTs. The assessment of TPs is performed using different assessment metrics and different threshold values for all of them. For IOU, Dice index, and IOG, we consider values 0.5, 0.7, and 0.9 for the threshold . We note that IOU (0.5) is recommended in the well-known Pascal challenge [9]. The threshold for BEP and BEP is 1-dimensional analogue of for IOU and IOG, respectively. Thus, we use three values , , and for . We use three values 0.6, 0.75, and 0.9 for the threshold . We include the results in which TPs are assessed using the

metrics alone. The precision and recall values of the 6 BGS groups identified in Table

II and the R-CNN are given in Table III. The precision and recall values are color coded for easy visual interpretation.

TCR methods are more effective at background subtraction than the other methods (see results of SuBSENSE in Fig. 5). So, false positive detections due to water background are very few, leading to better precision than other methods. Also, precision values of SuBSENSE for BEP metric are not poor considering that it was not developed specifically for the maritime domain. On the other hand, IMBS does not provide the best precision or recall even though it was developed specifically for the maritime domain. A reason could be that IMBS was developed for high mounted cameras in urban maritime, a setting different from the current dataset. The precision and recall results for R-CNN are expectedly better than the other approaches. However, noting that the R-CNN here detects the objects it has been trained for, the precision and recall should have been better. These clearly demonstrate the challenging nature of maritime CV.

The several false positives in most BGS methods (see Fig. 5) result in poor precision. Most methods have recall better than precision, with the exception of TCR methods. We also note that BEP values are more encouraging than IOU, Dice Index, IOG, and BEP. The better suitability of BEP was established in Table I. Moreover, it is noted in Table III that is less selective about TPs. This puts the responsibility on for improving the selectivity of BEP. On the other hand, is inherently more selective, as demonstrated by lower precision and recall values than . This directly helps in making BEP selective.

We compare assessment metrics IOU(0.5) and BEP, which correspond to most lenient threshold values. Recall values for BEP are better than IOU(0.5) in each group. For the most strict threshold values as well, recall values for BEP are better than IOU(0.9) in each group. The same can be inferred from the comparison of IOG and BEP, barring a few exceptions. Thus, although the conventional metrics indicate dismal performance of CV methods for maritime, the scene does not look so bleak when metrics designed for maritime domain are used. This highlights the need of both suitable metrics and dedicated CV solutions.

parameters Precision Recall
IOU 0.5 0.01 0.00 0.01 0.01 0.00 0.15 0.28 0.14 0.11 0.10 0.10 0.14 0.07 0.41
0.7 0.00 0.00 0.00 0.00 0.00 0.05 0.12 0.05 0.04 0.04 0.03 0.05 0.02 0.18
0.9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.01 0.01 0.01 0.00 0.00
Dice 0.5 0.01 0.01 0.01 0.01 0.00 0.25 0.35 0.26 0.20 0.19 0.18 0.25 0.11 0.51
0.7 0.00 0.00 0.00 0.01 0.00 0.14 0.25 0.12 0.09 0.08 0.08 0.11 0.07 0.37
0.9 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.02 0.02 0.01 0.01 0.02 0.01 0.04
IOG 0.5 0.01 0.01 0.01 0.01 0.07 0.43 0.40 0.32 0.30 0.20 0.19 0.32 0.19 0.58
0.7 0.01 0.01 0.01 0.01 0.07 0.40 0.32 0.24 0.26 0.14 0.13 0.25 0.17 0.47
0.9 0.00 0.00 0.00 0.00 0.07 0.36 0.17 0.15 0.19 0.09 0.07 0.17 0.16 0.24
0.6 0.01 0.01 0.01 0.01 0.00 0.18 0.26 0.15 0.12 0.13 0.10 0.15 0.06 0.38
0.6 0.00 0.01 0.01 0.01 0.00 0.17 0.24 0.13 0.10 0.12 0.08 0.12 0.06 0.35
0.6 0.00 0.00 0.00 0.00 0.00 0.13 0.16 0.08 0.07 0.08 0.06 0.08 0.04 0.23
0.75 0.00 0.00 0.00 0.00 0.00 0.12 0.15 0.09 0.07 0.08 0.05 0.09 0.04 0.21
0.75 0.00 0.00 0.00 0.00 0.00 0.11 0.14 0.08 0.06 0.07 0.05 0.07 0.04 0.20
0.75 0.00 0.00 0.00 0.00 0.00 0.10 0.09 0.05 0.04 0.05 0.04 0.05 0.03 0.13
0.9 0.00 0.00 0.00 0.00 0.00 0.04 0.02 0.03 0.03 0.03 0.02 0.03 0.01 0.03
0.9 0.00 0.00 0.00 0.00 0.00 0.04 0.02 0.03 0.03 0.03 0.02 0.03 0.01 0.03
0.9 0.00 0.00 0.00 0.00 0.00 0.04 0.01 0.02 0.02 0.03 0.02 0.02 0.01 0.02
0.6 0.01 0.01 0.02 0.01 0.00 0.21 0.33 0.38 0.31 0.27 0.23 0.38 0.12 0.49
0.6 0.01 0.01 0.01 0.01 0.00 0.21 0.31 0.35 0.28 0.25 0.21 0.32 0.12 0.45
0.6 0.01 0.01 0.01 0.01 0.00 0.17 0.21 0.25 0.20 0.20 0.16 0.25 0.09 0.31
0.75 0.01 0.01 0.01 0.01 0.00 0.16 0.26 0.33 0.26 0.23 0.19 0.33 0.10 0.38
0.75 0.01 0.01 0.01 0.01 0.00 0.16 0.23 0.30 0.24 0.21 0.17 0.30 0.10 0.34
0.75 0.01 0.01 0.01 0.01 0.00 0.13 0.15 0.23 0.18 0.18 0.14 0.23 0.08 0.23
0.9 0.01 0.01 0.01 0.01 0.00 0.09 0.16 0.26 0.21 0.18 0.14 0.27 0.08 0.23
0.9 0.01 0.01 0.01 0.01 0.00 0.09 0.14 0.24 0.20 0.17 0.14 0.25 0.08 0.20
0.9 0.00 0.00 0.01 0.01 0.00 0.07 0.09 0.19 0.15 0.15 0.11 0.20 0.07 0.13
0.6 0.12 0.24 0.05 0.05 0.01 0.59 0.58 0.88 0.92 0.78 0.70 0.86 0.45 0.85
0.75 0.09 0.17 0.04 0.04 0.01 0.53 0.55 0.81 0.87 0.72 0.63 0.80 0.37 0.81
0.9 0.05 0.07 0.03 0.03 0.01 0.39 0.45 0.62 0.70 0.56 0.46 0.62 0.26 0.65
0.6 0.01 0.01 0.02 0.01 0.00 0.23 0.34 0.38 0.31 0.28 0.24 0.38 0.13 0.49
0.75 0.01 0.01 0.01 0.01 0.00 0.17 0.24 0.30 0.24 0.22 0.18 0.30 0.10 0.35
0.9 0.00 0.01 0.01 0.01 0.00 0.08 0.09 0.20 0.15 0.15 0.11 0.20 0.07 0.13
Consistently best















TABLE III: Precision and recall of CV methods for the maritime dataset. Precision (white), (faded red), (light red), (red), (dark red). Recall (white), (faded blue), (light blue), (blue), (dark blue). Best results for each group identified in Table II are presented here. In each group, the methods that consistently give the best precision or recall for most assessment criteria are indicated in the bottom row.

Vi Discussion

We evaluated the existing metrics used for assessing the quality of BB detections in the context of maritime CV. The unique needs of maritime CV imply that the current metrics are unsuitable. The proposed bottom edge proximity metrics, custom designed for maritime CV problem, provide a good starting point. However, there is a need to explore more options for assessing detections in maritime CV. Such assessment metrics would be strict in assessing the location of the bottom edges and minimum span of the BBs, suitable for assessing inaccurate detections due to occlusion, and tolerant for BB degradation in presence of wake or exclusion of super-structure in the detected BB. It is worth considering if the conventional BB labeling of GT is suitable for maritime CV. In particular, it should be explored if the GT of each vessel should comprise of GTs for hull, super-structure, and their union. Associated problem is to design assessment of detected BBs for such GT. Creating shape segmentations as ground truth for large videos needs to be explored. Detections and their assessment in the form of shape segmentations can be explored for new maritime CV methods.

Our preliminary study of 36 background subtraction methods and two R-CNN experiments shows a gap in computer vision techniques for maritime applications. Appropriate modeling of maritime background can reduce false positives and improve precision. Modeling wakes as background as well may allow stricter assessment of span (larger ) and thus better assessment of occlusions as well. Large range of speeds and sizes of maritime objects may require innovative approaches for learning background with adaptive time scales in local regions. Deep learning also holds significant promise. Our current experiments assume the luxury of environment specific training. A more generalizable deep learning framework for maritime is needed for practical maritime computer vision.

We note that the maritime computer vision is in a nascent stage at present and thus it is too early to decide on a suitable metric. A better convergence on these topics will emerge with further engagement of the CV research community. This engagement can be in the form of new diverse maritime datasets and maritime CV challenges similar to the PASCAL challenge [9] with goal towards autonomous maritime vehicle technology.


  • [1] E. Tu, G. Zhang, L. Rachmawati, E. Rajabally, and G.-B. Huang, “Exploiting AIS data for intelligent maritime navigation: A comprehensive survey from data to methodology,” IEEE Transactions on Intelligent Transportation Systems, vol. PP, no. 99, pp. 1–24, 2017.
  • [2] D. Bloisi and L. Iocchi, “ARGOS A video surveillance system for boat traffic monitoring in Venice,”

    International Journal of Pattern Recognition and Artificial Intelligence

    , vol. 23, no. 07, pp. 1477–1502, 2009.
  • [3] F. Robert-Inacio, A. Raybaud, and E. Clement, “Multispectral target detection and tracking for seaport video surveillance,” 2007, pp. 169–174.
  • [4] D. K. Prasad, D. Rajan, L. Rachmawati, E. Rajabaly, and C. Quek, “Video processing from electro-optical sensors for object detection and tracking in maritime environment: a survey,” IEEE Transactions on Intelligent Transportation Systems, vol. 18, no. 8, pp. 1993–2016, 2017.
  • [5] P. J. Withagen, K. Schutte, A. M. Vossepoel, and M. G. Breuers, “Automatic classification of ships from infrared (flir) images,” in Signal Processing, Sensor Fusion, and Target Recognition VIII, vol. 3720, 1999, pp. 180–188.
  • [6] A. Cuzzocrea, E. Mumolo, and G. M. Grasso, “Advanced pattern recognition from complex environments: a classification-based approach,” Soft Computing, pp. 1–16, 2017.
  • [7] A. Tversky, “Features of similarity,” Psychological Review, vol. 84, no. 4, pp. 327–352, 1977.
  • [8] D. K. Prasad, C. K. Prasath, D. Rajan, L. Rachmawati, E. Rajabally, and C. Quek, “Object detection in maritime environment: Performance evaluation of background subtraction methods,” IEEE Transactions on Intelligent Transportation Systems, 2018.
  • [9] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” International Journal of Computer Vision, vol. 111, no. 1, pp. 98–136, 2015.
  • [10] A. H. Lai and N. H. Yung, “A fast and accurate scoreboard algorithm for estimating stationary backgrounds in an image sequence,” in IEEE International Symposium on Circuits and Systems, vol. 4, 1998, pp. 241–244.
  • [11] S. Calderara, R. Melli, A. Prati, and R. Cucchiara, “Reliable background suppression for complex scenes,” in ACM International Workshop on Video Surveillance and Sensor Networks, 2006, pp. 211–214.
  • [12] N. J. McFarlane and C. P. Schofield, “Segmentation and tracking of piglets in images,” Machine Vision and Applications, vol. 8, no. 3, pp. 187–193, 1995.
  • [13] A. Manzanera and J. C. Richefeu, “A new motion detection algorithm based on background estimation,” Pattern Recognition Letters, vol. 28, no. 3, pp. 320–328, 2007.
  • [14] Y. Benezeth, P.-M. Jodoin, B. Emile, H. Laurent, and C. Rosenberger, “Review and evaluation of commonly-implemented background subtraction algorithms,” in International Conference on Pattern Recognition, 2008, pp. 1–4.
  • [15] C. R. Wren, A. Azarbayejani, T. Darrell, and A. P. Pentland, “Pfinder: Real-time tracking of the human body,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 780–785, 1997.
  • [16] C. Stauffer and W. E. L. Grimson, “Adaptive background mixture models for real-time tracking,” in IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, 1999, pp. 246–252.
  • [17] Z. Zivkovic, “Improved adaptive gaussian mixture model for background subtraction,” in International Conference on Pattern Recognition, vol. 2, 2004, pp. 28–31.
  • [18] A. Sobral and A. Vacavant, “A comprehensive review of background subtraction algorithms evaluated with synthetic and real videos,” Computer Vision and Image Understanding, vol. 122, pp. 4–21, 2014.
  • [19] M. H. Sigari, N. Mozayani, and H. Pourreza, “Fuzzy running average and fuzzy background subtraction: concepts and application,” International Journal of Computer Science and Network Security, vol. 8, no. 2, pp. 138–143, 2008.
  • [20] Z. Zhao, T. Bouwmans, X. Zhang, and Y. Fang, “A fuzzy background modeling approach for motion detection in dynamic backgrounds,” in Multimedia and Signal Processing, 2012, pp. 177–185.
  • [21]

    A. Elgammal, D. Harwood, and L. Davis, “Non-parametric model for background subtraction,” in

    European Conference on Computer Vision, 2000, pp. 751–767.
  • [22] Y. Goya, T. Chateau, L. Malaterre, and L. Trassoudaine, “Vehicle trajectories evaluation by static video sensors,” in IEEE Intelligent Transportation Systems Conference, 2006, pp. 864–869.
  • [23] L. Maddalena and A. Petrosino, “A fuzzy spatial coherence-based approach to background/foreground separation for moving object detection,” Neural Computing and Applications, vol. 19, no. 2, pp. 179–186, 2010.
  • [24] N. M. Oliver, B. Rosario, and A. P. Pentland, “A bayesian computer vision system for modeling human interactions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 831–843, 2000.
  • [25] G. Liu and S. Yan, “Active subspace: Toward scalable low-rank learning,” Neural Computation, vol. 24, no. 12, pp. 3371–3394, 2012.
  • [26] P. Rodriguez and B. Wohlberg, “Fast principal component pursuit via alternating minimization,” in IEEE International Conference on Image Processing, 2013, pp. 69–73.
  • [27] M. Hintermüller and T. Wu, “Robust principal component pursuit via inexact alternating minimization on matrix manifolds,” Journal of Mathematical Imaging and Vision, vol. 51, no. 3, pp. 361–377, 2015.
  • [28] Q. Zhao, D. Meng, Z. Xu, W. Zuo, and L. Zhang, “Robust principal component analysis with complex noise,” in

    International Conference on Machine Learning

    , 2014, pp. 55–63.
  • [29] Z. Kang, C. Peng, and Q. Cheng, “Robust pca via nonconvex rank approximation,” in IEEE International Conference on Data Mining, 2015, pp. 211–220.
  • [30] S. Hauberg, A. Feragen, and M. J. Black, “Grassmann averages for scalable robust pca,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3810–3817.
  • [31] T. Zhou and D. Tao, “Greedy bilateral sketch, completion & smoothing,” in International Conference on Artificial Intelligence and Statistics, 2013.
  • [32] Z. Wang, M.-J. Lai, Z. Lu, W. Fan, H. Davulcu, and J. Ye, “Orthogonal rank-one matrix pursuit for low rank matrix completion,” SIAM Journal on Scientific Computing, vol. 37, no. 1, pp. A488–A514, 2015.
  • [33] L. Balzano, R. Nowak, and B. Recht, “Online identification and tracking of subspaces from highly incomplete information,” in Annual Allerton Conference on Communication, Control, and Computing, 2010, pp. 704–711.
  • [34] B. Vandereycken, “Low-rank matrix completion by riemannian optimization,” SIAM Journal on Optimization, vol. 23, no. 2, pp. 1214–1236, 2013.
  • [35] Y. Ji and J. Eisenstein, “Discriminative improvements to distributional sentence similarity,” in

    Conference on Empirical Methods in Natural Language Processing

    , 2013, pp. 891–896.
  • [36] G. Trigeorgis, K. Bousmalis, S. Zafeiriou, and B. W. Schuller, “A deep matrix factorization method for learning attribute representations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 3, pp. 417–429, 2017.
  • [37] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al., “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends in Machine learning, vol. 3, no. 1, pp. 1–122, 2011.
  • [38] X. Shu, F. Porikli, and N. Ahuja, “Robust orthonormal subspace learning: Efficient recovery of corrupted low-rank matrices,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3874–3881.
  • [39] M. Heikkila and M. Pietikainen, “A texture-based method for modeling the background and detecting moving objects,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 4, pp. 657–662, 2006.
  • [40] D. Bloisi and L. Iocchi, “Independent multimodal background subtraction,” in International Conference on Computational Modeling of Objects Presented in Images: Fundamentals, Methods and Applications, 2012, pp. 39–44.
  • [41] S. Noh and M. Jeon, “A new framework for background subtraction using multiple cues,” in Asian Conference on Computer Vision, 2012, pp. 493–506.
  • [42] P.-L. St-Charles and G.-A. Bilodeau, “Improving background subtraction using local binary similarity patterns,” in IEEE Winter Conference on Applications of Computer Vision, 2014, pp. 509–515.
  • [43] P.-L. St-Charles, G.-A. Bilodeau, and R. Bergevin, “Flexible background subtraction with self-balanced local sensitivity,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014, pp. 408–413.
  • [44] A. Sobral, “BGSLibrary: An opencv c++ background subtraction library,” 2013, pp. 1–16. [Online]. Available:
  • [45] A. Sobral, T. Bouwmans, and E.-h. Zahzah, “Lrslibrary: Low-rank and sparse tools for background modeling and subtraction in videos,” in Robust Low-Rank and Sparse Matrix Decomposition: Applications in Image and Video Processing.   CRC Press, Taylor and Francis Group.
  • [46] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in IEEE conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587.
  • [47] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems, 2015, pp. 91–99.
  • [48] G. Varol, I. Laptev, and C. Schmid, “Long-term temporal convolutions for action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PP, no. 99, pp. 1–1, 2017.
  • [49] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun, “Object detection networks on convolutional feature maps,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 7, pp. 1476–1481, 2017.