Optical Flow Based Real-time Moving Object Detection in Unconstrained Scenes

07/13/2018 ∙ by Junjie Huang, et al. ∙ 0

Real-time moving object detection in unconstrained scenes is a difficult task due to dynamic background, changing foreground appearance and limited computational resource. In this paper, an optical flow based moving object detection framework is proposed to address this problem. We utilize homography matrixes to online construct a background model in the form of optical flow. When judging out moving foregrounds from scenes, a dual-mode judge mechanism is designed to heighten the system's adaptation to challenging situations. In experiment part, two evaluation metrics are redefined for more properly reflecting the performance of methods. We quantitatively and qualitatively validate the effectiveness and feasibility of our method with videos in various scene conditions. The experimental results show that our method adapts itself to different situations and outperforms the state-of-the-art methods, indicating the advantages of optical flow based methods.



There are no comments yet.


page 2

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In this paper, we study the detection of moving object. Aiming at detecting moving objects from complex scenes, many methods have been proposed and developed in depth. As moving object is defined according to its state of motion, it can not be commendably detected by a feature based well-trained classifier like

[3]. This common task is handled by some frameworks, which can be classified roughly into two categories: one is analyzing foreground and background together to discriminate them into two classes [8][18]. The other is to obtain a discriminant background model for judging out the foreground points. For example, Tom et.al, [5]

used statistical models Dirichlet process Gaussian mixture model (DP-GMM). Cui et.al.

[1] and Zhou et.al. [26] modeled the background as a low rank matrix. And the others used Fuzzy Models [9], Robust Subspace Models [4], Sparse Models [6], Optical Flow Velocity Field Models [25], et.al. Methods mentioned above, to some extent, can reach a certain level foreground extraction. However, they mainly work under some strong constraints like under stationary scenes [5][15][25], using batch processing [1][8][18][26], or needing global optimization [1][8][26].

To get rid of these constraints, we propose an optical flow based framework. The framework adopts the background modeling method but models the background online as well as in the scenes simultaneously including background and foregrounds, which is different from [25]

. We firstly estimate the optical flow field by performing algorithm FlowNet2.0

[7]. Then we estimate a intermediate variable(i.e. the homography matrix) which can give a parametric description of the sensor’s motion. Unlike many other works [10][11][22][23], who estimate the homography matrix using point pairs obtained by point tracking algorithm LK [14] or KLT [19], we obtain point pairs using the optical flow field directly. This can avoid introducing extra computation cost and avoid introducing unreliable information as the tracking algorithms LK and KLT are done without global optimization. Finally the background is modeled in the form of optical flow using the homography matrix.

Subsequently, the moving foregrounds are judged out by setting a threshold for the difference between the optical flow provided by optical flow estimating algorithm and that provided by the background model. To increase the accuracy of judgment and strengthen the system’s adaptation to different situations, a dual-mode judge mechanism is introduced in this work to deal with the problem caused by the sensor’s evident zooming(the details are described in Section III-C).

In experiment part, two evaluation metrics are redefined. Because if the F-Measure evaluation metric is defined as in [20] et.al, the results in the frames that contain small foreground add little impact to the video-level result. We calculate frame-level precision, recall and F-Measure first, and the video-level result is obtained by averaging over all frames in the same video. In this way, we enable the evaluation metrics to deal with some videos that contain unbalanced size of foreground in different frames, and to more properly reflect the methods’ ability of detecting foreground. We test the robustness of the method using ten videos with various scene conditions. Our method qualitatively and quantitatively outperforms state-of-the-art algorithms in this test. Moreover, we also test the efficiency of the proposed framework in depth and offer some advice for practical applications.

The contributions of this work are as follows: Firstly, the proposed background modeling method is performed online and is efficient enough for real-time application. Meanwhile, the background model constructed by our method is more precise than that by the existing methods. Secondly, a dual-mode judge mechanism is introduced to strengthen the system’s adaptation to different situations. Thirdly, we redefine two evaluation metrics to make them more convictive and demonstrate the effectivity of our method through comprehensive experiments.

The remainder of this paper is organized as follows. Section II reviews the related work. Our detection framework based on optical flow is detailedly introduced in Section III and its effectiveness is verified in Section IV by comprehensive experiments. Finally, Section V is devoted to conclusions.

Ii Related Work

In this section, we review recent algorithms for moving object detection in terms of several main modules: Gaussian model based, optical flow model based and optical flow gradient based.

Gaussian Model Based. The method proposed in [22] used Dual-Mode Single Gaussian Model (SGM) to model the background in grid-level, and utilized homography matrixes between consecutive frames to accomplish motion compensation by mixing models. Foreground was figured out by estimating the feature’s conformity to the corresponding SGM. Benefitting from Dual-Mode SGM, the method can reduce the foreground’s pollution to the background models. Analogously, Yun and Jin [23], and Kurnianggoro et.al, [11]

used a foreground probability map and simple pixel-level background models respectively to fine-tune the result obtained in

[22]. Method in [13]

is based on SGM and interpolated a full covariance matrix of the pixel models to achieve the motion compensation. The background model constructed and updated by these methods lack a reflection to the essence of the problem .They are sensitive to parameters and lack of robustness to different scenes.

Optical Flow Model Based. Kurnianggoro et.al, [10]

modeled the background using zero optical flow vectors instead. After using a homography matrix to align the previous frame, dense optical flow was estimated between the result of aligning and the current frame. Finally a simple optical-flow magnitude threshold was used to judge out the foreground points. As the homography matrixes are only used for aligning, the background model and the judge mechanism constructed by this method are too simple to deal with intricate unconstrained scenes.

Fig. 1: Visualization of the moving object detection framework.

Optical Flow Gradient Based. There are some other methods that do not depend on any background models. They constructed the contour of foreground based on detecting large gradient points in dense optical flow field. For example, Li and Xu [12] performed mathematical morphology operations on the initial contours to obtain closed boundaries. After that the maximal contour area was selected as the area of the moving object. This simple framework can be performed easily but also limits the method to simple scenes. Papazoglou and Ferrari [17] combined the optical flow’s gradient and direction to generate a better contour. Then, an efficient inside-outside maps algorithm was performed to initially figure out the foreground points, which was finally fine-tuned by global optimization. The short coming is that the inside-outside maps algorithm can obtain reasonable result only in simple scenes that contain a single object. Moreover, the optimization operation makes it inefficient.

Iii Methodology

The framework of our online detection method for moving objects in dynamic scenes is shown in Fig. 1. There are mainly three processes: optical flow estimating, background modeling and foreground extracting. In the following, each step of the framework is introduced in detail.

Iii-a Optical Flow Estimating

Taking into account speed and accuracy, FlowNet2.0 [7] is used to estimate the optical flow vectors , which project 2D locations in frame to the locations in specified frame . The optical flow vectors are used as the main feature in the following procedures.

Iii-B Background Modeling

In our framework, background model is constructed in the form of optical flow utilizing homography matrixes. To obtain the homography matrixes , we establish the equation (1) that reflects the same effectiveness of two conversion processes: transforming via homography matrixes and transforming via optical flow.


where , . As a homography matrix contains free variables, different sample points are used.

The least square solution of equation (1) is solved and optimized by RANSAC [2] to obtain a more reliable result. We perform RANSAC with the sample point number and the iterations , which can provide an ideal success rate above , given the assumption that the background occupies area more than half of the images. To improve the efficiency of RANSAC algorithm, the sampling points are sparsely sampled in 2D image plane. Specifically, the images are partitioned into 16 pieces and of them are randomly selected, then one point is randomly chosen inside each selected pieces. Finally, the ideal background model is constructed in the form of optical flow that is calculated by the following equation:


where is the ideal optical flow vector of each background point.

Iii-C Foreground Extracting

Subsequently, based on the background model, we judge out the foreground points by utilizing a dual-mode judge mechanism. Under normal conditions, we apply a adaptive threshold to the difference between the ideal background optical flow and the actual optical flow, and obtain a foreground mask as described in (3):


where is the 2-norm of the complement vector. The adaptive threshold is defined as:


where and are the hyper-parameters used to control the magnitude of threshold. is the static component part corresponding to the destabilization caused by the sensor’s resolution or the optical flow’s precision. is used to introduce the dynamic component part, and we use a high threshold when the sensor moves fast. And the magnitude of the homography matrix elements and linearly reflects the speed of the sensor’s motion.

It is reasonable to obtain moving foregrounds in this way under most situations except that there is evident zooming composition in the scene change. Under this special situation, the spacial distribution of the background optical flow is unbalanced in amplitude, which will cause unbalanced spacial distribution of the difference between ideal optical flow and true optical flow, just as shown in Fig. 2(b). Thus a fixed threshold calculated by the aforementioned method is not effective for judging out the foreground properly.

(a) The mixture optical flow field.
(b) The intensity map of .
(c) The result obtained by (3).
(d) The ideal background optical flow field.
(e) The intensity map of .
(f) The result obtained by (7).
Fig. 2: Illustration for the evident zooming situation. In Fig. 2(d), the white points denote the sampling background points and the red one denotes the intersection point.

We firstly detect this situation by analyzing the direction and amplitude of the background optical flow. As shown in Fig. 2(d), under the evident zooming situation, the directions of background points’ optical flow will intersect at the same vanishing point , which is inside the image and the variation of the optical flow amplitudes in background area will exceed a certain threshold. In this work, we utilize the sample points which are used to calculate the homography matrix on behalf of the background points. With these background points, coordinates of the vanishing point is calculated by geometrical analysis and least square regression:



denotes the matrix multiplication of elements. Then a judgment indicator is defined as:


where , denotes the ranges of coordinates inside image. For , a value of 1 indicates that there is a evident zooming composition, on the other hand, a value of 0 indicates there isn’t.

The ensuing question is how to properly judge out the foreground points when the magnitude threshold loses efficacy. Because the direction of true optical flow is highly identical to that of ideal optical flow in background area as shown in Fig. 2(e), the foreground is extracted by a different judge mechanism:


In Equation (7), the cosine value of the angle between the true optical flow and the ideal background optical flow is calculated, and is applied to judge out the foreground points by comparing it with a threshold . Just as shown in Fig. 2(c) and Fig. 2(f), judging according to is more efficient in situations with evident zooming than that according to .

1:  Input: images and
2:  estimating optical flow utilizing and ;
3:  establishing the equation (1) and solving it to obtain a homography matrix ;
4:  obtaining the background model utilizing (2);
5:  judging out the evident zooming situation by (6);
6:  if  then
7:     extracting foreground mask utilizing (3)
8:  else
9:     extracting foreground mask utilizing (7)
10:  end if
11:  Output: foreground mask
Algorithm 1 Moving Object Detection

Iv Experiment

The proposed method is implemented using Matlab, and is roundly tested with ten video sequences captured by unconstrained cameras: Playground1(PG1), Playground2(PG2), Skating1(SK1), Skating2(SK2), Walking(WK), Car1, Car2, Horse, Train and Highway(HW). Detecting moving object in these sequences is challenging, due to camera movement, irregular object movement, variational object appearance, bad weather and many other reasons. PG1, PG2, SK1 and WK sequence are from [24], SK2, Train and HW are from [20], Car2 and Horse are from [16], Car1 is from [21] and annotated by our.

(a) input image
(b) ground truth
(c) MCD5.8ms
(d) SA
(e) SCBU
(f) Ours
Fig. 3: Qualitative results on some key frames from different videos. From top to bottom: Car1, SK1, PG1, PG2, WK, SK2, Car2, Horse, Train and HW. The first column shows input images, and the other columns show the results of the compared methods: (a) input image, (b) ground truth, (c) MCD5.8ms [22] , (d) Stochastic Approx(SA) [13] , (e) SCBU [24] and (f) Our method.

Iv-a Evaluation Metrics

In this subsection, we redefine two metrics for the method performance evaluation: F-Measure(FM), Success Rate(SA). Given true positive(TP), false positive(FP), false negative(FN) and true negative(TN), F-Measure is defined as the harmonic mean of Precision(Pr) and Recall(Re) by:


where , , denotes the sequence number of a frame in a video containing frames. F-Measure ranges from 0 to 1, where a value of 1 indicates that the prediction totally agrees with its ground truth, on the other hand, a value of 0 indicates total disagreement.

Success Rate is used to more intuitively observe how well the method detects foreground. Given all of a video sequence that contains images and a F-Measure threshold(), Success Rate(SR) is defined as:


where ranges from 0 to 1, and after the value of changes, a curve is constructed as shown in Fig. 4. The methods whose success rate curve are close to the top and right of the plot respectively have higher detecting success rate and higher detecting quality.

Iv-B Parameters setting

Optical flow is estimated between frame and frame , which means that the interval was set as . For RANSAC algorithm, sampling point number and iterations are set as in Section III-C . For foreground judging, parameters are set as following: , , and .

Iv-C Qualitative comparisons

Our method is compared with the following state-of-the-art methods for moving object detection under an unconstrained camera: MCD5.8ms [22], SA [13] and SCBU [24]. Fig. 3 shows the qualitative results on some key frames from the experimental video sequences.

The qualitatively comparative results can intuitively show the proposed method’s adaptability to different challenges comparing with the other methods. As shown in SK2 and Train sequences, SGM base methods MCD5.8ms and SCBU perform poorly when the foreground color is slightly similar to the background. According to PG1, PG2 and SK2 sequences, SA can not deal with the challenges of slow motion and dynamic background. Benefitting from the optical flow based model and the dual-mode judge mechanism, the proposed method can export foreground with higher quality in all these scenes. According to the results of PG1, PG2, Car2 and HW sequences, our method is sensitive to shadow, which leads to some false positive results and has negative influence on the quantitative results.

Method PG1 PG2 SK1 WK SK2 Car1 Car2 Horse Train HW AVG
MCD5.8ms 0.356 0.546 0.821 0.729 0.595 0.618 0.260 0.672 0.260 0.433 0.529
SA 0.123 0.276 0.384 0.774 0.475 0.606 0.405 0.788 0.257 0.668 0.476
SCBU 0.543 0.679 0.759 0.672 0.632 0.356 0.120 0.633 0.127 0.459 0.506
Ours 0.550 0.653 0.643 0.677 0.827 0.887 0.733 0.909 0.867 0.724 0.747
TABLE I: The pixel-wise F-measure results for all of ten videos. AVG denotes Average.

Iv-D Quantitative comparisons

We also quantitatively compare our method with the state-of-the-art methods: MCD5.8ms [22], SA [13] and SCBU [24]. As shown in Table. I, the pixel-wise F-Measure of each video sequence is calculated and an average between videos is given in the back. While the existing methods perform worse in some situation, the proposed method outputs steadily when facing all different challenges. It is noteworthy that in PG1 video sequence, the pixel-wise F-Measure results of all method are rather low. We analyze the masks outputted by each method and find out that PG1 contains very tiny object with rather slow motion in the first frames, which is rather difficult for the systems to detect. Comparing our result with that in [24], the redefined evaluation metrics can more properly reflect the performance of the methods.

Fig. 4 illustrates the average success rate plots of these compared methods. The proposed optical flow based method outperforms other methods both in success rate and in quality. Given a specific threshold value of , the success rate of the proposed method is , which is high enough for practical applications.

Fig. 4: Relation curves of F-Measure threshold and Success rate.

Iv-E Efficiency

We measured the computation time per frame to evaluate the efficiency of the proposed method. Table. II shows the computation time measured by Matlab on an Intel Core i5-7400 3.0GHz PC with video at a resolution of . As shown in the result, the optical flow estimation process occupies eight out of ten total time consumption, and the foreground extracting process spent relatively less time.

Process OE MD total
Time(ms) 16 139
TABLE II: Time consumption of the proposed method. The entries show the time consumption of each process in the form of ms per frame. OE denotes Optical Flow Estimation, MD denotes Moving Object Detection, result is quoted from [7].

Further experiment shows that the time consumption in foreground extraction process has a linear correlation with the iterations of RANSAC algorithm. Fig. 5 illustrates the relation curve of iterations and time consumption as well as the relation curve of iterations and the ideal success rate of finding out the correct background model. When the iterations increase to , the success rate has been above 0.9 and the speed of increasement has been slow obviously. So it is reasonable to set the iterations around .

Fig. 5: Illustration of RANSAC algorithm iterations’ effect. The red curve shows the success rate related to iterations and the blue curve shows the time consumption.

V Conclusion

We propose an optical flow based framework for real-time moving object detection in unconstrained scenes. The background model is constructed in the form of optical flow utilizing homography matrixes, and a dual-mode judge mechanism is introduced to heighten the system’s adaptation to different situations. In experiment part, two evaluation metrics are redefined for more properly reflecting the performance of the methods. The quantitative and qualitative results obtained by our framework outperform the state-of-the-art methods indicating the advantages of optical flow based method. Finally, the precision and frame rate of the optical flow estimation algorithm are the prerequisite of the success of our frame. With the development of optical flow estimation algorithm, the performance of our framework will correspondingly improve.


  • [1]

    Cui X, Huang J, Zhang S, et al. Background Subtraction Using Low Rank and Group Sparsity Constraints[C]// European Conference on Computer Vision. Springer-Verlag, 2012:612-625.

  • [2] Fischler M A, Bolles R C. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography[M]. ACM, 1981.
  • [3]

    Girshick R, Donahue J, Darrell T, et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation[C]// IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2014:580-587.

  • [4]

    Guyon C, Bouwmans T, Zahzah E H. Robust Principal Component Analysis for Background Subtraction: Systematic Evaluation and Comparative Analysis[M]// Principal Component Analysis. InTech, 2012.

  • [5] Haines T S F, Xiang T. Background Subtraction with Dirichlet Processes[M]// Computer Vision – ECCV 2012. Springer Berlin Heidelberg, 2012:99-113.
  • [6] Huang J, Huang X, Metaxas D. Learning with dynamic group sparsity[C]// IEEE, International Conference on Computer Vision. IEEE, 2009:64-71.
  • [7] Ilg E, Mayer N, Saikia T, et al. FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks[J]. 2016:1647-1655.
  • [8] Keuper M, Andres B, Brox T. Motion Trajectory Segmentation via Minimum Cost Multicuts[C]// IEEE International Conference on Computer Vision. IEEE, 2015:3271-3279.
  • [9] Kim W, Kim C. Background Subtraction for Dynamic Texture Scenes Using Fuzzy Color Histograms[J]. IEEE Signal Processing Letters, 2012, 19(3):127-130.
  • [10] Kurnianggoro L, Shahbaz A, Jo K H. Dense optical flow in stabilized scenes for moving object detection from a moving camera[C]// International Conference on Control, Automation and Systems. IEEE, 2016:704-708.
  • [11] Kurnianggoro L, Wahyono, Yu Y, et al. Online Background-Subtraction with Motion Compensation for Freely Moving Camera[J]. 2016.
  • [12] Li X, Xu C. Moving object detection in dynamic scenes based on optical flow and superpixels[C]// IEEE International Conference on Robotics and Biomimetics. IEEE, 2016:84-89.
  • [13] López-Rubio F J, López-Rubio E. Foreground detection for moving cameras with stochastic approximation ☆[J]. Pattern Recognition Letters, 2015, 68:161-168.
  • [14]

    Lucas B D, Kanade T. An iterative image registration technique with an application to stereo vision[C]// International Joint Conference on Artificial Intelligence. Morgan Kaufmann Publishers Inc. 1981:674-679.

  • [15] Dibyendu Mukherjee, Q.M. JonathanWu. Real-timeVideoSegmentation Using Student’stMixture Model[J]. Procedia Computer Science, 2012, 10:153-160.
  • [16] Ochs P, Malik J, Brox T. Segmentation of Moving Objects by Long Term Video Analysis[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2014, 36(6):1187-1200.
  • [17] Papazoglou A, Ferrari V. Fast Object Segmentation in Unconstrained Video[C]// IEEE International Conference on Computer Vision. IEEE, 2014:1777-1784.
  • [18] Shi J, Malik J. Motion segmentation and tracking using normalized cuts[M]. University of California at Berkeley, 1997.
  • [19] Tomasi C. Detection and tracking of point features[J]. Technical Report, 1991, 91(21):9795-9802.
  • [20] Wang Y, Jodoin P M, Porikli F, et al. CDnet 2014: An Expanded Change Detection Benchmark Dataset[C]// Computer Vision and Pattern Recognition Workshops. IEEE, 2014:393-400.
  • [21] Wu Y, Lim J, Yang M H. Object Tracking Benchmark[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2015, 37(9):1834-1848.
  • [22] Yi K M, Yun K, Kim S W, et al. Detection of Moving Objects with Non-stationary Cameras in 5.8ms: Bringing Motion Detection to Your Mobile Device[C]// Computer Vision and Pattern Recognition Workshops. IEEE, 2013:27-34.
  • [23] Yun K, Jin Y C. Robust and fast moving object detection in a non-stationary camera via foreground probability based sampling[C]// IEEE International Conference on Image Processing. IEEE, 2015:4897-4901.
  • [24] Yun K, Lim J, Jin Y C. Scene Conditional Background Update for Moving Object Detection in a Moving Camera[J]. Pattern Recognition Letters, 2017, 88.
  • [25] Zhang S, Zhang W, Ding H, et al. Background modeling and object detecting based on optical flow velocity field[J]. Journal of Image & Graphics, 2011, 16(2):236-243.
  • [26]

    Zhou X, Yang C, Yu W. Moving Object Detection by Detecting Contiguous Outliers in the Low-Rank Representation[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2013, 35(3):597-610.