Improving Multiple Object Tracking with Optical Flow and Edge Preprocessing

01/29/2018 ∙ by David-Alexandre Beaupré, et al. ∙ Corporation de l'ecole Polytechnique de Montreal 0

In this paper, we present a new method for detecting road users in an urban environment which leads to an improvement in multiple object tracking. Our method takes as an input a foreground image and improves the object detection and segmentation. This new image can be used as an input to trackers that use foreground blobs from background subtraction. The first step is to create foreground images for all the frames in an urban video. Then, starting from the original blobs of the foreground image, we merge the blobs that are close to one another and that have similar optical flow. The next step is extracting the edges of the different objects to detect multiple objects that might be very close (and be merged in the same blob) and to adjust the size of the original blobs. At the same time, we use the optical flow to detect occlusion of objects that are moving in opposite directions. Finally, we make a decision on which information we keep in order to construct a new foreground image with blobs that can be used for tracking. The system is validated on four videos of an urban traffic dataset. Our method improves the recall and precision metrics for the object detection task compared to the vanilla background subtraction method and improves the CLEAR MOT metrics in the tracking tasks for most videos.



There are no comments yet.


page 1

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Object detection is a fundamental task in the field of computer vision. It is a necessary step in traffic surveillance in order to collect traffic data and analyze road user behavior. It is used to extract image regions that correspond to the objects of interest. Improving the detection of the cars, cyclists and pedestrians can help to improve another important task, which is multiple object tracking (MOT). Many trackers, for instance Urban Tracker (UT)

[1, 2] and Multiple Kernelized Correlation Filter Tracker (MKCF) [3], use foreground blobs from background subtraction as an input to track the objects in the video because these detections are generic and do not assume any prior classes. UT is a more complex tracker using feature points and a state machine to keep track of the different objects while MKCF is a fast tracker with simpler data association for tracking multiple objects. Yet, both depend on the quality of background subtraction.

There are many problems with the images produced by background subtraction methods in an urban environment. The first one is foreground blob merging, which occurs when two road users occlude each other or are close to each other as in figure 1. Even if trackers have ways of dealing with these occlusions, we found that it is advantageous to explicitly detect the different occluding objects prior to tracking. Another problem of background subtraction methods is the case of fragmentation where a unique object is separated in multiple smaller blobs. Once again, we found that it is easier to explicitly merge the fragmented blobs than to let the tracker decide if the multiple blobs were part of the same object or not. Another problem of background subtraction is caused by shadows, mainly by pedestrians’ shadows. In fact, foreground blobs will often include the shadows of the objects, which can lead to a tracking box that is much larger and not as precise as the one without shadows, or to merging different objects in the same blob. Our method is able to eliminate most of the unwanted shadows which lead to a more precise detection. One last problem is that the background subtraction blobs are generally much bigger than the real size of the objects that we want to detect. Our proposed method was able to effectively adjust the size of the proposed blobs which results in better recall in the detection metrics.

Our method uses background subtraction [4], optical flow [5] and edge processing in order to create a new binary image of foreground blobs. Background subtraction is used to locate the regions of interest (RoIs), which are the location of the foreground blobs in the current frame. The dense optical flow, with a patch size of 8, is then computed for each blob

in the given frame. The motion vectors of the different regions and the relative distance between the regions are compared to merge the blobs that are very likely of being fragmented regions of the same object. The optical flow computed at each frame (with the previous one) is also used to separate objects merged in the same blob that are moving in opposite direction. We use the Canny edge detector

[6] on both the blobs of the source frame and the scene background image (see section III-A) to obtain the edges of the foreground objects (we want to eliminate background edges that might be in a foreground blob i.e. road markings near pedestrians). This last step allows adjusting the size of the objects, separating close objects that appeared as one blob in background subtraction and eliminating noise. With all this information, our novel method generates a new binary image with processing steps that handle fragmentation, merging and remove noise while giving a more precise segmentation.

The organization of the paper is the following: in section II, we discuss related work. In section III, we present our new method consisting of the foreground image, merging of similar optical flow regions, separation of opposite flow regions, edge processing and creation of the new binary image. In section IV, we present our results and finally, in section V, we conclude this paper.

Ii Related work

Many methods can be used in order to extract the object RoIs in a given frame. Objects proposal methods like [7, 8] can get good recall results given a large number of proposals. Also, these methods do not require the input to be a video since they propose boxes based on their “objectness”. The downside of object proposal methods is to filter the thousands of initial proposals to extract the real objects in our frame, which is often less than twenty in tracking tasks, and to make sure that every object only has one bounding box. The challenge of keeping the best box around each object while keeping high recall is difficult to achieve for the purpose of a tracking.

Another method to extract RoIs is optical flow as in [5, 9]. Optical flow is the process of computing the motion of every pixel between two consecutive frames. By grouping pixels with similar motion, this results in blobs of pixels for each object with different motion. Thus, these methods are very good at detecting moving objects, but segmenting individual objects from a group can be more difficult, especially if they are moving in the same direction. In fact, two objects very close to one another will be considered in the same motion flow blob since their flow vectors will be very similar. In addition, these methods cannot detect still objects. However, two close objects going in opposite directions are very easy to separate with optical flow methods as stated earlier.

Recently, deep learning methods have achieved great results in object detection as seen in

[10, 11]

while being able to make those detections almost in real time. However, these neural networks must be trained on every class we want them to detect, which can take up a lot of time and resources. They cannot detect objects from unexpected classes.

Finally, another traditional approach to obtain object RoIs is background subtraction, like with ViBe [4] and SubSENSE [12]. In this case, RoIs are the results of the differences between the current frame and a background frame model. These methods can detect objects from any class. However, they can be sensitive to camera motion and shadows. Also, they cannot resolve merging caused by occlusion or proximity. However, they are very appealing for tracking in urban scenes because of the unknown variety of objects of interest these scenes may contain.

As mentioned above, we chose ViBe [4] to provide us with the initial RoIs. ViBe is a background subtraction method that keeps track of the values of each pixels in the past to determine if a pixel in the current frame is in the foreground or the background. For a given frame, every blob produced by ViBe will be fed into our algorithm in order to improve the detection of objects.

Fig. 2: Diagram of the steps of our method

Iii Methodology

This section presents the different steps of our method as shown in figure 2. These steps are simple operations, using optical flow and edge analysis.

Iii-a Background image

The first operation is to accumulate a color background image from the video sequence. This will become useful in the edge processing step (see section III-D) because we will be able to filter out most of the background edges that may be included in foreground edges e.g. road marking that are not of interest [13]. The background image is given by


where is an accumulation rate and is an input image. In the experiments, , which means that each new frame has a weight of 0.01 in the running average and the mean image has a weight of 0.99.

Iii-B Merging foreground blobs

The first step to process a frame is to check if we can merge any foreground blobs that satisfy the three following conditions (C1), (C2) and (C3) presented below. If this is the case, a new blob is formed from the union of the blobs and . The union operation in our case takes the smallest box that frames both blobs and .

The first condition (C1) is given by


where is the minimum distance between the pixels of two blobs and . This distance must be smaller that a threshold that can be modified, but we found experimentally that a distance of 7 pixels is a good compromise because we need to merge objects of various sizes (car and pedestrian dimensions vary in different datasets).

The second condition is based on intervals of the magnitude of the optical flow of blobs, built as the mean magnitude

plus or minus one standard deviation

. The lower and upper bounds for the blobs and can be written as


We can now define the domain of possible values for each blob as


The second condition (C2) is then given by


This condition verifies that both domains and have at least one value in common. The third condition (C3) is


This condition checks if the angles of the optical flow for the blobs and are approximatively in the same direction. We take the angle of the optical flow at the center point of each blob for the comparison with the threshold . The value of is . This means that we sometimes merge (i.e. union operation) two foreground blobs that should not have been merged, but we prefer to err on the side of over-merging because it is possible to separate objects at a later stage of our method. At this step, we also save which RoIs were modified and store them in a map that will be used in section III-E. These new foreground blobs will be the new RoIs for the next steps.

Iii-C Flow separation

The purpose of this step is to separate foreground blobs that contain two objects going in opposite directions. For each foreground blob

in the image, we apply the k-means clustering algorithm to the optical flow vectors. We chose

because when there are two objects moving in opposite direction, the segmentation of these objects results in one cluster for the background, and the other two clusters as the distinct objects. We fit bounding boxes , and around each of the three clusters. We can then compute the ratio, , of the intersection of the boxes over the smallest area of the two boxes as


Since there are only three boxes, this gives us three for all pairs . We compare each against a threshold of 0.40 and if the ratio is smaller, we check if the boxes and are going in opposite directions using the negation of the third condition expressed by equation 7. For a given blob , we will keep two bounding boxes ( and ) if both conditions were met. If not, we simply fit a bounding box around the original foreground blob for this particular region and ignore , and . During this process, we save every RoI that has been split in two in a map that will be used in section III-E.

Fig. 3: Some examples of our method (first line) and the ViBe algorithm (second line) with both trackers (a), (b), (e) and (f) is MKCF while (c), (d), (g) and (h) is UT. First column is an example of how our method is able to separate close objects in the Rene video. Second column shows how objects moving in opposite directions are easier to separate with our method on the Rouen video. Third column demonstrates how our method is able to keep whole objects in the case of occlusion by other static objects in the Sherbrooke video. Fourth column is an example of how our method gives smaller boxes around the pedestrians in the St-Marc video.

Iii-D Edge processing

This is where we use the background image created at the first step. We extract from the background image the pixels included in the blob followed by an edge detection to form a representation of the same size as . We do the same thing for the current image , forming of the same size. Edges of and are obtained using the Canny edge detector[6] using for threshold the values of and given by


The value of was determined experimentally and set to . The value of is computed for each image and corresponds to the median pixel value in the grayscale image.

With the two edges representations for and , we can make a logical xor operation for each pixel of the regions. This will eliminate the background edges from the foreground blob edges, leaving us with only edges of the foreground . Moreover, this enables us to eliminate foreground blobs that were in fact background objects. This increases the precision of our method and adjusts better the size of the detection boxes to the objects.

Also, the edge processing step can separate two objects or more that are in the same foreground blob . This operation will also separate blobs that should not have been merged previously. When we obtain our edge representation image , we form groups with the pixels based on distance in order to detect if there are more than one object in the current blob . To do this, we choose a random edge pixel and find every connected edge pixel with a Manhattan distance of less than 3 pixels to form an edge group . We repeat this process until every edge pixel is a member of an edge group. When this is the case, we find the biggest bounding box for every edge group . The number of bounding boxes corresponds to the number of distinct objects in a foreground blob . Once again, we store each RoI that has been split into multiple regions in a map for the next step.

Iii-E Decision algorithm

At this point, the information we have is three maps: one from the merging of regions by optical flow (see section III-B), another from the separation of regions (see section III-C) and the last one from the edge analysis (see section III-D). This means that we have, in the best case, two box proposals for each RoI (from the separation map and edge analysis map), but there can be more than that if any of the processing returned more than one box. We present now the algorithm to make our decision regarding which boxes to keep for the final foreground image.

The first thing we check is if an RoI has been modified by the flow separation step (see section III-C). If this is the case, we keep the two boxes returned by the optical flow because we are sure that the two objects were moving in opposite direction. The second verification is if an RoI has been modified both by the flow merging (see section III-B) and the edges (see section III-D), we keep the boxes from the edges processing, because if the merging was successful, the edges will give us one box around the object and if not, the edges will give us multiple boxes depending on the number of objects. The third check that we make is on the number of boxes that the edge processing step returned. If this number is greater or equal than four, we ignore them and simply keep the one box proposed by the optical flow. This is because it is more likely that the edges over-separated one single object into multiple ones and will lead to a bad detection. The fourth verification covers the situation where the edges proposed two boxes, and , and the optical flow only one, . This is the hardest case because we do not know if there are truly two objects in the RoI or if, for instance, the edge processing separated the shoes of a pedestrian from the rest of its body. We compute the area ratio, , from both processes in order to make our decision:


We keep the the single box from the optical flow processing if the ratio is smaller or equal to 0.65, a parameter determined experimentally. Otherwise, we keep both boxes and from the edges processing step. Finally, in all the other situations, we simply favor the edge boxes over the ones from the optical flow because they tend to be smaller that their counterpart.

Iii-F New final foreground image

The process to create the new binary image is quite simple. We start by creating an image made only of zero valued pixels. After that, we do a xor operation between the image and the box proposals, which are represented by white pixels. This means that when two objects share an intersection, the pixels at the intersection become black and this leads to better detection inputs for the trackers as the objects are separated. The last operation is to increase the size of those intersections by one pixel in every direction (dilation operation) since it facilitates the segmentation for the trackers. Note that since the resulting object masks are combinations of bounding boxes, objects are just segmented coarsely.

Iv Results

In order to evaluate our proposed method, we used the publicly available UT dataset [1] containing four video sequences of urban mixed traffic. The videos contain pedestrians, cyclists and cars. There were multiple frames that were annotated in each sequence so we could test our method. The evaluation of our method was made in two steps. First, we compared the object detection performance of our method versus the original background subtraction method. Second, we showed how our method can improve the MKCF tracker [3], a tracker with a simple data association scheme, and the Urban Tracker (UT) [1], a tracker with a more complex data association scheme, when given the new foreground images compared to the ones produced by the ViBe method [4]. Our method improves object detection in all videos, and tracking results for most videos.

The code for our method can be downloaded from

Iv-a Evaluation methodology

For the evaluation of our method for the object detection task, we used the Intersection over Union (IoU) metric between the detected bounding boxes and the ground-truth bounding boxes. Then to evaluate our method for the tracking task, we used the tools provided with the Urban Tracker dataset [1]. These tools compute the CLEAR MOT [14]

metrics. The multi-object tracking accuracy (MOTA) takes into account the false positives, the ID changes and the misses. The multi-object tracking precision (MOTP) measures the average precision of object matches at each instant. We evaluated our method with an IoU of 30 %. We decided not to use the classical IoU of 50 % because when evaluating with the trackers, most of the CLEAR MOT metrics were negatives as the videos are difficult. Also, when looking at the MKCF and UT papers, we found that they were using distances between the centroid of the boxes, and that the values of these distances were quite permissive. For instance, in the Rouen video, the distance threshold was of 164 px, which is 20.5 % of the width and 27.3 % of height of the video frame. This distance is generous in a way that objects moderately far away can still be considered matched and tracked. Also, the absolute distance does not consider the size of the objects. For example, there are cars and pedestrians in the Rouen video, so a distance of 164 px might be reasonable for cars that are bigger than pedestrians generally, but not for pedestrians. By using an IoU of 30 %, we remain flexible for the tracking accuracy while considering the relative size of the different objects that are tracked. We ran the code of both trackers to obtain the results since we changed the evaluation metric. Results are thus different from the ones reported in their respective papers. We kept the default parameters for UT, but had to change the minimum blob size for two videos (Rene-Levesque and St-Marc) for the MKCF tracker.

For the detection, we also used an IoU of 30 % because we wanted to remain consistent between our two evaluations. Even when we tested with an IoU of 50 %, our method had better recall and precision than the original background subtraction.

Iv-B Experimental results

For the detection task, the results can be found in table I

. Our method shows improved results for both the precision and recall across all four videos of the Urban Tracker dataset. The most significant improvement is for the Rouen video, followed by Sherbrooke. This can be explained by the fact that a lot of objects are traveling in opposite directions in both of these videos. We are thus able to better separate objects.

Dataset Recall (ViBe) Recall (Ours) Precision (ViBe) Precision (Ours)
Sherbrooke 0.606 0.752 0.681 0.739
Rene-Levesque 0.812 0.855 0.612 0.654
Rouen 0.734 0.834 0.724 0.823
St-Marc 0.684 0.754 0.415 0.458
TABLE I: Object detection results of ViBe and our proposed method on the UT dataset. Precision and recall should be high. Boldface: best results

The quantitative results for the MKCF tracker and UT are presented in table II. For the Sherbrooke video sequence, we see that our method is able to improve both the MOTA and the MOTP for both trackers. The MOTA is increased significantly while the MOTP has a more modest improvement. This is due to the fact that the difficulty of this video sequence comes from the large number of cars moving in opposite directions. Our method is able to separate those objects with the optical flow and give an image segmented with each car individually while the original background subtraction merges cars going in opposite direction in the same blob.

ViBe Ours ViBe Ours ViBe Ours ViBe Ours
Sherbrooke 0.317 0.523 0.553 0.576 0.404 0.690 0.576 0.590
Rene-Levesque 0.334 0.424 0.5309 0.660 0.565 0.613 0.582 0.705
Rouen 0.501 0.629 0.582 0.600 0.696 0.670 0.617 0.620
St-Marc 0.463 0.534 0.652 0.651 0.638 0.653 0.691 0.682
TABLE II: Tracking results of MKCF and UT using ViBe and our detections on the UT dataset. MOTA and MOTP should be high. Boldface: best results

The Rene-Levesque video sequence contains a large number of cars and the camera is far from the scene, which means that the objects of interest are all very small. We increase the MOTA of UT by 5 % and the MOTP by 12 %. Our method is able to improve UT because it is able to separate adjacent cars and because the edge processing reduces the size of the boxes. This leads to a more precise tracking. We were also able to improve both metrics with the MKCF tracker, but to do this, we had to reduce the minimum blob size parameter (100 pixels) in the tracker algorithm because, as mentioned earlier, our method reduces the size of the boxes and many proposed boxes were smaller than the threshold originally used by the algorithm. Thus, there were no tracker on some objects which led to poor results. Using the new parameters, we were able to significantly improve the MOTA and MOTP. Note that the same parameters were used with ViBe.

For the Rouen video, we improve the results of the MKCF tracker in terms of both MOTA and MOTP, while we only improve the MOTP for UT. The difficulty of this dataset is coming from the number of pedestrians crossing the street. Once again, these pedestrians are going in opposite direction but there are also some who walk at the same speed and close to one another. These pedestrians are the hardest to detect individually. Our method, which can segment pedestrians going in opposite directions performs well with the simpler MKCF tracker because it helps the tracker during occlusions. UT remains better with the original background subtraction because our method will sometimes merge two pedestrians going in the same direction.

For the St-Marc video, we also had to change the minimum blob size parameter (700 pixels) in the MKCF tracker for the same reasons as stated above. The MOTA was improved while producing a slightly lower MOTP. The same logic can be transfered for UT where the MOTA was slightly improved and the MOTP decreased. The main challenge from this sequence is that there is a group of four pedestrians walking together. Our method was not able to consistently separate those four pedestrians and this is why we are not able to improve the accuracy by a large margin.

V Conclusion

In this paper, we presented a new method capable of creating better foreground images which, in turn, was shown to improve the performance of two trackers (MKCF and UT). We start from a traditional background subtraction method to obtain our RoI and then, with the help of the optical flow and edge preprocessing, we are able to deal with the fragmentation caused by the background subtraction and effectively separate objects that are either too close to one another or objects that are going in opposite directions. This method improves both the recall and the precision in the object detection task when compared to the original foreground image. It also improves the CLEAR MOT metrics for both trackers for most of the tested videos.


  • [1] J. Jodoin, G. Bilodeau, and N. Saunier, “Urban tracker: Multiple object tracking in urban mixed traffic,” in IEEE Winter Conference on Applications of Computer Vision, Steamboat Springs, CO, USA, March 24-26, 2014, 2014, pp. 885–892.
  • [2] J. P. Jodoin, G. A. Bilodeau, and N. Saunier, “Tracking all road users at multimodal urban traffic intersections,” IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 11, pp. 3241–3251, Nov 2016.
  • [3] Y. Yang and G. Bilodeau, “Multiple object tracking with kernelized correlation filters in urban mixed traffic,” CoRR, vol. abs/1611.02364, 2016. [Online]. Available:
  • [4] O. Barnich and M. V. Droogenbroeck, “Vibe: A universal background subtraction algorithm for video sequences,” IEEE Trans. Image Processing, vol. 20, no. 6, pp. 1709–1724, 2011.
  • [5] T. Kroeger, R. Timofte, D. Dai, and L. V. Gool, “Fast optical flow using dense inverse search,” in Proceedings of the European Conference on Computer Vision (ECCV), 2016.
  • [6] J. Canny, “A computational approach to edge detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 1986.
  • [7] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals from edges,” in ECCV, 2014.
  • [8]

    M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. H. S. Torr, “BING: Binarized normed gradients for objectness estimation at 300fps,” in

    IEEE CVPR, 2014.
  • [9] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid, “Deepflow: Large displacement optical flow with deep matching.” in ICCV.   IEEE Computer Society, 2013, pp. 1385–1392.
  • [10] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems (NIPS), 2015, pp. 91–99.
  • [11] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” arXiv preprint arXiv:1612.08242, 2016.
  • [12] P.-L. St-Charles, G.-A. Bilodeau, and R. Bergevin, “Subsense: A universal change detection method with local adaptive sensitivity,” IEEE Transactions on Image Processing, vol. 24, no. 1, pp. 359–373, 2015.
  • [13] R. Farah, J. Langlois, and G. Bilodeau, “Catching a rat by its edglets,” Image Processing, IEEE Transactions on, vol. 22, no. 2, pp. 668–678, 2013.
  • [14] K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: The clear mot metrics,” EURASIP Journal on Image and Video Processing, vol. 2008, no. 1, p. 246309, May 2008.