The detection of moving object is popular for researching. Aiming at detecting moving objects from complex scenes, many methods have been proposed [1, 2, 3, 4, 5, 6]. Besides, some datasets have been collected and published [7, 8], promoting the development of moving object detection. Among these works, a dominant paradigm for the output of moving object detection is some foreground masks. These masks consist of pixel-wise labels which provide a detailed discriminative result indicating whether a pixel belong to the moving foreground. But for the oncoming issue like track and instance analysis, it lacks of some result with practical value, as there aren’t any direct outputs indicating how many moving objects in the scenes, where are them, what are the sizes of them and which pixels belong to the same moving object, et.al. Compared to the pixel-wise labels, these outputs are more helpful for the subsequent processes. Thus, to a certain extent, the foreground masks obtained by aforementioned methods are unshaped and postprocessing procedures are needed for obtaining the instance-level information of moving objects.
We address the problem of moving foreground analysis by constructing an optical flow based framework. Optical flow is used as the main feature of pixels providing crucial information during the analysis processes. Optical flow has been extensively studied [9, 10, 11, 12, 13], and has been widely used for video analysis [7, 14, 15, 16] benefited from its direct reflection of the scene’s motion information. However, to obtain the satisfactory result, researchers still encounter many obstacles because of some common problems. Such as precision problem caused by large displacements, occlusion and intensity changes , computation cost problem caused by algorithm complexity , and application problem caused by the discrete distribution of optical flow. FlowNet2.0 proposed in 
offers an optical flow estimation that is accurate enough for practical application and fast enough for online application.
In many other works like [4, 7], optical flow was estimated between adjacent frames. However, under most situations, due to high frame rate and relatively low object moving speed and unsteady motion, the optical flow distinction between foreground and background is too small to be directly used to extract the foreground targets correctly under the influence of interference. To enhance the feature discrimination between foreground and background, a dominant paradigm is using point trajectories as the feature like . As it needs a point tracking procedure to obtain the point trajectories, the shortcomings of this method include additional consumption of computing and storage resource, introduction of extra interference and higher complexity of feature. As described in Section II-A, we adopt an approximation of point trajectories to effectively enhance the feature discrimination while avoiding these problem.
Our framework integrates the foreground mask, the optical flow and the intensity information to find out some useful instance-level information. To this end, as shown Figure 1, there are two major steps: composition analysis and foreground segmentation. Composition analysis addresses the problems of how many moving objects in the scene and which points can be used to initially locate them. Partitioning clustering algorithms are qualified for these problems as they produce single-level clustering result . In this work, we adopt Clustering by Fast Search and Find of Density Peaks (CFSFDP) 
method as it provides an automatic mechanism for analyzing the number of clusters and offering a representative center for each cluster, which can meet our demands. On the other hand, foreground segmentation addresses the problem of differentially labeling pixels between different instances. Hierarchical clustering algorithms fit this problem as they output multi-level nested decompositions. Considering the irregular moving object shape and the continuous optical flow distribution inside a moving object, Graph-Based Image Segmentation(GBIS)  method is applied in this step as it can be performed efficiently and outputs suitable result. We combine the result obtained from these two steps by using the result of composition analysis to guide the foreground segmentation and using the result of foreground segmentation to merge the result from composition analysis. As a result, we suppress most false positive result and obtain a high precision. Furthermore, two situations are considered within our procedure. First is that the two instances own different movement information. And the second is that the two instances are spatially apart from each other. These provide guidance for procedure design and give expression in the processing procedure.
We evaluate the proposed moving foreground analysis framework on ten video sequences provided by ChangeDetection2014 dataset(CDnet2014) . Qualitative result and quantitative result are both list in Section IV for reference. Moreover, the complementary effect of the two clustering algorithms is explored to proof the necessity. We also test the efficiency of the proposed framework in depth and offer some advice for practical applications.
The contributions of this work are as follows: Firstly, we construct an efficient optical flow based framework for addressing the problem that analyzes the foreground masks obtained by moving object detection, and output instance-level information with practical value. Secondly, through experiment, we demonstrate that the proposed framework adapts itself to the problem and offer some advises for applications.
Our goal is to segment the moving object detection result into different instances and obtain instance-level information of moving objects. To this end, we construct the processing framework as shown in Figure 1. There are mainly four processes: sampling, composition analyzing, foreground segmenting and postprocessing. In the following, each part of the framework is introduced in detail.
Ii-a Optical Flow Estimating
We adopt an approximation of point trajectories, that is estimating optical flow between frame and frame , where is an integral parameter related to the application context and limited by the ability of optical flow estimation algorithm calculating large displacement. As described in Formula (1
), vector sums ofoptical flow vectors is used to replace the point trajectories in . The feature discrimination can be enhanced in a reasonable way which is quite simple but efficient.
where is the optical flow vector which projects a 2D location in frame to the location in specified frame .
As shown in Figure 2, with the increment of , the difference between foreground and background gradually becomes evident, which has a significant impact on the moving object detection. Taking into account speed and accuracy, FlowNet2.0  is used to compute the optical flow vectors, which are points from the latest frame to frame . The optical flow vectors are used as the main feature in the following procedures.
We sparsely sample points and only retain 1/9 of the total points as sampling too much points contributes little to improve the system’s performance, but causes a huge amount of computation. In this paper, the computational complexity of using method CFSFDP  is proportional to the square of the number of sample points. And the computational complexity of using method GBIS  represents linearity correlation to the number of sample points. In addition, we find out the foreground sample points from the all sample points utilizing the foreground masks provided by ground truth:
Ii-C Composition analyzing
In this subsection, we aim at finding out how many moving objects in the scene and initially locating them by using some representative points. The density map’s peak points defined in CFSFDP  method reflect different individuals as shown in Figure 3.
To find out the peak points, the feature of sample point is defined as:
where and are the coordinates of the sample point . is a parameter used to trade off the influence from optical flow and coordinates. A random sampling is adopted to maintain points, for the sake of controlling the computational consumption of CFSFDP below a certain level while making as little influence as possible to the result. Then the density of sample point is defined as:
where is the distance between two points in a data space, and is a threshold. After that, the minimum between one point and the other point which has higher density is used to defined as:
In this work, we define two kinds of : and . is calculated using optical flow, and is coordinates. Finally, the peak points are judged out by the following criterion:
where , and are three thresholds. The condition means that a peak point should own different optical flow compared to the points with higher density, or spatially apart from them. Condition
is used to exclude the outliers.and are set as adaptive thresholds, where is the maximum density inside the current frame.
Ii-D Foreground segmenting
In this part,we adopt the GBIS  method to divide the foreground points into different sections. Optical flow is used as the feature of each pixel. Different from Felzenszwalb, et.al , under the influence of sparse sampling, we construct the edges between points and their eight nearest neighbors to ensure the continuity of an instance. However, this will produce many superfluous edges as three edges for a point are enough for ensuring the connectivity of the graphs. So, the four edges with minimum edge weight are constructed in practice. Then, Algorithm 1 in GBIS  is used to segment the foreground. In Algorithm 1, the parameter is used to control the degree of polymerization in the form of , where is the number of points inside a group. reflects the desired size of output groups, we set it as an adaptive variable indicating the desired object size:
where denotes the number of foreground sample points. denotes the number of peaks obtained in Section II-C. After finishing Algorithm 1, a set of segment result is obtained.
Firstly, the section that includes any peak points is selected to construct a set of final foreground instances :
Secondly, the peaks obtained in section II-C are filtered. Specifically speaking, among all the peaks that belong to the same section, only the one with highest density is retained on behalf of this section and a set of representative peaks is obtained as:
At last, a minimum bounding box is used to include all sample points that belong to the same foreground . Then we obtain a bounding box corresponding to a moving foreground by slightly enlarging . In practice, we enlarge the width and height of the bounding box by a specific pixel number which is equal to the sparse sample interval.
The proposed method is implemented with Matlab, and is tested with ten video sequences that contain different numbers of foreground instances in different scenes. In this section, the video sequences and evaluation metrics are introduced first. Then, we test the proposed method’s output qualitatively and quantitatively. Finally, the contribution of composition analysis and the frame rate of the proposed method are explored respectively.
The video sequences are provided by the pedestrian subset of ChangeDetection2014 dataset (CDnet2014) . CDnet2014 offers three vital needed information in our experiments: intensity images of video sequences and pixel-wise foreground masks for input, instance-level annotation in the form of bounding boxes for output reference. The pedestrian subset contains ten video sequences and total 16864 bounding boxes for pedestrian annotation. Foreground analysis in these video is challenge as the scenes contain uncertain numbers, unbalance sizes and irregular shapes of objects. Besides, occlusion problem is another main challenge.
Iii-B Evaluation Metrics
In this section, we discuss the main three different metrics for the method performance evaluation: Intersection over Union(IoU), Recall(Re) and Precision(Pr). IoU metric is introduced to measure the accuracy of the bounding box result. It also used as a threshold for judging whether an instance is correctly analyzed. Recall (Re) metric is used to reflect how well the method figure out the instance level information. As shown in Figure 5. The methods whose recall curves are close to the top and right of the plot have high success rate and quality respectively. Precision (Pr) metric is used to reflect how many mistake made by methods make. As shown in Figure 5. The methods whose precision curves are close to the top of the plot produce little false positive results.
Iii-C Parameter setting
We estimate optical flow between and . In the sampling process, the sample interval was set as . For CFSFDP, the balanced parameter was set as , and the thresholds were set as , and .
Iii-D Quantitative result
|Note: The ten sequences from 1 to 10 are: Backdoor, BusStation, CopyMachine, Cubicle, Office, Pedestrians, PeopleInShade, PETS2006, Skating, Sofa .|
Iii-E Qualitative result
Some bounding box results obtained by our method are illustrated in Figure 4. As the scenes contain different challenges, comparing the detecting result (the red bounding boxes) with the ground truth (the yellow bounding boxes), one can visually make out that the proposed method can output instance-level bounding boxes with a high success rate as well as high quality.
Obviously our method outputs result with high recall and high precision. An average recall exceeding ninety percent means that most of instances inside the scenes have been successfully found out, which is enough high for practical applications. On the other hand, an average precision over ninety percent means that we obtain a satisfying result at the cost of a very small amount of false positive results. This is of significance in practical applications as too much false positive results will cause huge extra computing consumption in the subsequent processes.
Iii-F Effect of composition analyzing
Table II shows the contributions of CFSFDP method. Without using CFSFDP for composition analysis, the recall drops slightly. This is because the parameter is set as a constant, as the framework didn’t have any message of how many objects inside the scenes. The foreground segmentation process lacks of a adaptive desired section size for guidance. As a result, some segments either only contain a section of an object or contain more than one object. Besides, the precision drops sharply. This is because without Formula (8), all segments produced by foreground segmentation process are considered as a protential objects. As shown in Figure 1, contains many tiny false positive segments compared with . Only through combining CFSFDP and GBIS, can the framework suppress most false positive results and greatly improve the precision performance.
Table III shows the computation time measured by Matlab on an Intel Core i5-7400 3.0GHz PC. As shown in the result, the optical flow estimating process and the foreground segmenting process occupy most of the total time consumption. The composition analysis process and the postprocessing process spend relatively less time.
|Optical flow extracting|
|Note: The entries show the time consumption of each process in the form ms per frame. result is quoted from .|
Further experiment shows that the time consumption of CFSFDP method is proportional to the square of the sample point number. And the time consumption of GBIS method represents linearity correlation to the number of sample points. Computational consumption of CFSFDP method is a bottleneck when the number of sample points increases as shown in Figure 6(a). However, as shown in Figure 6(b), the increase of sample point number contributes little to the detection results in fact, when the number of sample points is above 200. Thus, we set the sample point number as .
In this work, we focus on the problem that analyzes the foreground masks obtained by moving object detection and outputs the instance-level moving object information. This is of great significance to the application of moving object detection. To address this problem, we proposed an optical flow based framework mainly utilizing two complementary clustering algorithms to analyze and segment the foreground. Beside, our frame output several kinds of moving objects information, which can be directly used in the following procedures like track or instance analysis. In experiment part, we use quantitative and qualitative results to indicate that our framework is designed properly and is effective enough for most practical applications.
-  Kurnianggoro L, Shahbaz A, Jo K H. Dense optical flow in stabilized scenes for moving object detection from a moving camera[C]//Control, Automation and Systems (ICCAS), 2016 16th International Conference on. IEEE, 2016: 704-708.
-  Kurnianggoro L, Yu Y, Hernandez D C, et al. Online background-subtraction with motion compensation for freely moving camera[C]//International Conference on Intelligent Computing. Springer, Cham, 2016: 569-578.
-  Li X, Xu C. Moving object detection in dynamic scenes based on optical flow and superpixels[C]//Robotics and Biomimetics (ROBIO), 2015 IEEE International Conference on. IEEE, 2015: 84-89.
Papazoglou A, Ferrari V. Fast object segmentation in unconstrained video[C]//Proceedings of the IEEE International Conference on Computer Vision. 2013: 1777-1784.
Moo Yi K, Yun K, Wan Kim S, et al. Detection of moving objects with non-stationary cameras in 5.8 ms: Bringing motion detection to your mobile device[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2013: 27-34.
Yun K, Choi J Y. Robust and fast moving object detection in a non-stationary camera via foreground probability based sampling[C]//Image Processing (ICIP), 2015 IEEE International Conference on. IEEE, 2015: 4897-4901.
-  Ochs P, Malik J, Brox T. Segmentation of moving objects by long term video analysis[J]. IEEE transactions on pattern analysis and machine intelligence, 2014, 36(6): 1187-1200.
-  Wang Y, Jodoin P M, Porikli F, et al. CDnet 2014: an expanded change detection benchmark dataset[C]//2014 IEEE conference on computer vision and pattern recognition workshops. IEEE, 2014: 393-400.
-  Bailer C, Taetz B, Stricker D. Flow fields: Dense correspondence fields for highly accurate large displacement optical flow estimation[C]//Proceedings of the IEEE international conference on computer vision. 2015: 4015-4023.
-  Cheng J, Tsai Y H, Wang S, et al. Segflow: Joint learning for video object segmentation and optical flow[C]//Computer Vision (ICCV), 2017 IEEE International Conference on. IEEE, 2017: 686-695.
-  Fortun D, Bouthemy P, Kervrann C. Optical flow modeling and computation: a survey[J]. Computer Vision and Image Understanding, 2015, 134: 1-21.
-  Ilg E, Mayer N, Saikia T, et al. Flownet 2.0: Evolution of optical flow estimation with deep networks[C]//IEEE conference on computer vision and pattern recognition (CVPR). 2017, 2: 6.
-  Zach C, Pock T, Bischof H. A duality based approach for realtime TV-L 1 optical flow[C]//Joint Pattern Recognition Symposium. Springer, Berlin, Heidelberg, 2007: 214-223.
-  Colque R V H M, Caetano C, de Andrade M T L, et al. Histograms of optical flow orientation and magnitude and entropy to detect anomalous events in videos[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2017, 27(3): 673-682.
-  Kim J, Joo K, Oh T H, et al. Human body part classification from optical flow[C]//Ubiquitous Robots and Ambient Intelligence (URAI), 2016 13th International Conference on. IEEE, 2016: 903-904.
-  Shuifa Z, Wensheng Z, Huan D, et al. Background modeling and object detecting based on optical flow velocity field[J]. Journal of Image and Graphics, 2011, 16(2): 236-243.
-  Lee I, Yang J. Common clustering algorithms[M]. Elsevier, 2009.
-  Rodriguez A, Laio A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191): 1492-1496.
-  Felzenszwalb P F, Huttenlocher D P. Efficient graph-based image segmentation[J]. International journal of computer vision, 2004, 59(2): 167-181.