Object tracking in video signals using Compressive Sensing

02/08/2019 ∙ by Marijana Kracunov, et al. ∙ Apple, Inc. 0

Reducing the number of pixels in video signals while maintaining quality needed for recovering the trace of an object using Compressive Sensing is main subject of this work. Quality of frames, from video that contains moving object, are gradually reduced by keeping different number of pixels in each iteration, going from 45 object, results were satisfactory and showed mere changes in trajectory graphs, obtained from original and reconstructed videos.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

I-a Compressive sensing

Thanks to the sparse property, that portrays considerable number of signals that surround us, the engineers are able to use compressive sensing algorithm and so make much better use of electronics today. In order to understand Compressive Sensing [1]-[15] one must first comprehend well famous theorem dating back to 1949. That’s when Claude Shannon, widely known as ”the father of information theory”, stated: “If a function contains no frequencies higher than hertz, it is completely determined by giving its ordinates at a series of points spaced seconds apart”. Even though this is a fundamental principle in the field of information theory the whole premise of CS lies on avoiding it. Shannon’s theorem states that the resolution of an image is proportional to the number of measurements. Doubling the resolution calls for doubling the measurements.
But thanks to the development of CS, the reconstruction of super-resolved images and signals from far fewer measurements than deemed necessary is possible. Not only that it also offers hope for directly acquiring a compressed digital representation of a signal without first sampling that signal.
What makes this theorem so applicable is it’s not only restricted to noiseless signals but ,with few or none alternations to algorithm, it also covers the real-world signals which by default include noise.
The system inspected here is an under-determined system. An under-determined system is a system in which the measurements taken are less than the number of unknown signal values which results in a system with infinite number of solutions. Fortunately, this problem is solvable if the signal is compressible. In order to sense a sparse objects, by taking as few measurements as possible, the best approach is measuring at randomly selected frequencies.


- discrete signal

vector of linear measurements formed by taking inner products of

with a set of linearly independent vectors

, .
A – has rows
Encoding is the process of obtaining b from and decoding is the process of recovering
Presuming that b is derived from a highly sparse signal, meaning it has very few non-zero elements. The best decoding approach is too look for the sparsest signal of all those that take part in producing the measurement b


-the number of non-zeros in .
Due to the complexity of solving this problem by enumeration the ” -norm” is replaced by the ” -norm”


If we are given a sparse vector we can write it down as a weighted linear combination.


– column vectors
– Fourier coefficients
but only of these are going to be zero the rest are not going to be zero. , is the number of pixels. If the signal is sparse enough and with the right matrix A than can solve both (2) and (3) for . This is called recoverability.

Again, here we are putting an emphasis on actual implementation of CS algorithm in video recordings. Dividing frames from the original video and using CS on every frame independently is in most cases time consuming. Some of the neater solutions can be seen in project regarding this subject [1]. Key idea is to exploit the intra-frame correlation. Using modification of the original Kalman Filter so that the motion prediction is incorporated and as such proceeds to the CS step more beneficially.

I-B Object tracking

Object tracking is the act of locating a moving object or multiple moving objects over time and reconstructing the trajectory that the objects took [16]-[23].The simplest known method is the block matching technique. While observing the current frame we take a block of pixels into consideration. The idea is to estimate the position by comparing the blocks within a region containing the targeted object with the reference frame [2].The results would be better if the image region occupied by the target in the current frame is visibly different from the other parts of the frame. Problems that occur during object tracking are caused by several factors such as slow frame rate when the object is moving fast, interference caused by a cluttered background etc.[3].

In our study of movement, we got the position, velocity and acceleration of a ball, from a video file recorded with a fixed camera. The detection of this is based on frame subtraction, a common background is subtracted from each frame and then treated. When program works in 2D, we assume that all the movement, velocity and acceleration is contained in a vertical plane perpendicular to the camera. In 3D, the ball is free moving.

Ii CS and object tracking

Here CS represents a helping hand for managing videos on which OT will be performed. How low can we go on memory storage and still get a satisfying trace that’s actually useful. Problems that are still rising questions are complex background, brightness changes but one of the most challenging are the ones including sudden and/or short ‘disappearance’ of the actual object, i.e. occlusion occurs, in region where installed cameras have no information the cause usually being the object “hiding” behind another. This requires reconstruction of path by assumptions based on its positions before and after vanishing and also on speed and acceleration information.
The higher the frequency of frames the better results can be expected. Characterization of an object is an extremely important component for any type of object tracking algorithms. In order to examine all this, we used CS on two different types of signals one that could be described as considerably sparser than the other.

Fig. 2: 4/75 still frames from reconstructed video (1%)
Fig. 1: 4/75 still frames from original video
Fig. 2: 4/75 still frames from reconstructed video (1%)
Fig. 3: 4/75 still frames from reconstructed video (5%)
Fig. 1: 4/75 still frames from original video

Iii Experimental Results

Experiments were conducted using two different videos of a bouncing ball. The Compressive Sensing algorithm used in this experiment requires the video to be converted to grayscale. Then the video is transformed into image frames. Compressive sensing is done on to each frame. The compressed frames are then pieced together to form a new video. Finally the trace of the ball in the video is reconstructed and plotted in 2D.

Experiment I

The first video is much more simple, meaning, it is far sparser than the other one. Due to the small number of nonzero coefficients, it serves as a good example of compressive sensing. Figure 3 shows few still frames from the original video.
Duration of this video is 2.5 seconds, frame rate is 24.32 frames per second. Unlike the experiment with the other video, which will be discussed later, the background of this video is blank. This is a huge advantage because the background will not cause any interference. As a result of the simplicity of the video, video frames are well reconstructed using 45%, 30%, 20%,10%,5% and even 1% of available pixels. Figure 3 and 3 present few reconstructed frames from 1% and 5% available pixels respectively and figure 4 presents the difference between the trace of the object when the video is not compressed (red dots) and when only 1% (left) and 5% (right) of the pixels were taken (blue dots).

Fig. 4: Trajectory of the ball reconstucted from video of 1%(left) and 5%(right) of original pixels

As we can see with 5% we have complete overlapping hence showing results with the higher percentage of conserved pixels is unnecessary.

Fig. 6: 4/75 still frames from reconstructed video (1%)
Fig. 7: 4/75 still frames from reconstructed video (5%)
Fig. 8: 4/75 still frames from reconstructed video (10%)
Fig. 9: 4/75 still frames from reconstructed video (20%)
Fig. 5: 4/75 still frames from original video
Fig. 6: 4/75 still frames from reconstructed video (1%)
Fig. 7: 4/75 still frames from reconstructed video (5%)
Fig. 8: 4/75 still frames from reconstructed video (10%)
Fig. 9: 4/75 still frames from reconstructed video (20%)
Fig. 10: Time spent on Compressive Sensing algorithm of the two videos
Fig. 5: 4/75 still frames from original video

Experiment II

The second experiment is conducted using a much more complicated video than the previous one. The motion of the ball is more complex, the background is diverse and the object has a shadow (figure 10).
Duration of the video is 2.3667 seconds, size of the video frame is 512x512 and frame rate is 30 frames per second.
In figures 13 to 13 we can see that with the decrease of percentage of pixels retained,the deviation is bigger. With only 45% of retained pixels complete match is achieved while with 1% of retained pixels the object can be tracked with decent precision, but the trace is shifted from the original.
Experiments were done using three different computers with different specifications. Table provided in fig 10 shows the elapsed time of the compressive sensing algorithm for different percentages.
In addition we are presenting PSNR (Peak signal-to-noise ratio) results in figures 15 and 15. As the figures show, we will get a high PSNR when compressive sensing is introduced to the first video. That means that the video will retain its quality even when 1% of pixels are sensed. That is expected when taken the ‘simplicity’ of the first video into consideration. But when it comes to the second experiment the results are quite different. Given the complexity of the second video, when 1% of pixels are used more errors are introduced. The first video has a much higher PSNR than the second video hence the quality of the first video is better. When 45% of pixels are contained, the first video has a PSNR of about 80dB and the second video has a PSNR of about 30dB.

Fig. 12: Trajectory of the ball reconstucted from video of 10%(left) and 20%(right) of original pixels
Fig. 11: Trajectory of the ball reconstucted from video of 1%(left) and 5%(right) of original pixels
Fig. 12: Trajectory of the ball reconstucted from video of 10%(left) and 20%(right) of original pixels
Fig. 13: Trajectory of the ball reconstucted from video of 30%(left) and 45%(right) of original pixels
Fig. 11: Trajectory of the ball reconstucted from video of 1%(left) and 5%(right) of original pixels
Fig. 14: PSNR of the first video with different percentage of pixels conserved
Fig. 15: PSNR of the second video with different percentage of pixels conserved

Iv Conclusion

In this paper, we preformed compressive sensing and object tracking on two videos . The second video was less sparse than the first one. The first video has less details, a blank background and smoother movement. All that led to the conclusion that even with 1% of pixels retained the video closely resembled the original. And the object tracking was still pretty accurate. Due to the complexity of the second video, the results were not as satisfactory as they were with the first one. Even with 10% of pixels conserved a clear decline in quality can be seen. This results in shifted reconstructed path, of the targeted object, when compared to the original. Nevertheless, if this algorithm is used when rigorous accuracy isn’t the main goal, the results are still adequate.


  • [1] Roummel F. Marcia and Rebecca M. Willett,”Compressive coded aperture video reconstruction”,Department of Electrical and Computer Engineering, Duke University Box 90291, Durham, NC 27708, USA 16th, European Signal Processing Conference (EUSIPCO 2008), Lausanne, Switzerland, August 25-29, 2008, copyright by EURASIP
  • [2] S. Stankovic, I. Orovic, E. Sejdic, ”Multimedia Signals and Systems: Basic and Advance Algorithms for Signal Processing”, Springer-Verlag, New York, 2015.
  • [3] Y. C. Eldar, G. Kutyniok, ”Compressed Sensing Theory and Applications”, Cambridge University Press; 1 edition (June 29, 2012), 558 pages, ISBN-10: 1107005582.
  • [4] LJ. Stankovic, S. Stankovic, M. Amin, ”Missing Samples Analysis in Signals for Applications to L-estimation and Compressive Sensing,” Signal Processing, vol. 94, pp. 401-408, Jan. 2014.
  • [5] M. Elad, Sparse and Redudant Representations: From Theory to Applications in Signal and Image Processing, Springer 2010.
  • [6] D. Mackenzie ”Compressed Sensing Makes Every Pixel Count”, What’s Happening in the Mathematical Sciences, Volume 7, pp. 115 – 127, 2009.
  • [7] E. Sejdic, I. Orovic, S. Stankovic, ”Compressive sensing meets time-frequency: An overview of recent advances in time-frequency processing of sparse signals”, Volume 77, June 2018, Pages 22-35.
  • [8] I. Volaric, V. Sucic, S. Stankovic, ”A Data Driven Compressive Sensing Approach for Time-Frequency Signal Enhancement”, Volume 141, December 2017, Pages 229-239.
  • [9] E. Candes, J. Romberg, T. Tao, ”Stable Signal Recovery from Incomplete and Inaccurate Measurements”, Applied and Computational Mathematics, University of California, Los Angeles, CA 90095 February, 2005.
  • [10] A. Draganic, I. Orovic, S. Stankovic, “On some common compressive sensing recovery algorithms and applications - Review paper,” Facta Universitatis, Series: Electronics and Energetics, Vol 30, No 4 (2017), pp. 477-510.
  • [11] S. Stankovic, I. Orovic, ”An Approach to 2D Signals Recovering in Compressive Sensing Context,” Circuits Systems and Signal Processing, April 2017, Volume 36, Issue 4, pp. 1700-1713, 2016.
  • [12] G. Pope, “Compressive Sensing: a Summary of Reconstruction Algorithms”, Eidgenossische Technische Hochschule, Zurich, Switzerland, 2008.
  • [13] M. Brajovic, I. Orovic, M. Dakovic, S. Stankovic, ”Gradient-based signal reconstruction algorithm in the Hermite transform domain,” Electronic Letters, Volume: 52, Issue: 1, pp. 41 - 43, 2015
  • [14] I. Orovic, A. Draganic, S. Stankovic, ”Sparse Time-Frequency Representation for Signals with Fast Varying Instantaneous Frequency,” IET Radar, Sonar and Navigation, Vol. 9, Issue: 9, pp. 1260 - 1267, ISSN: 1751-8784.
  • [15] A. G. Rad, A. Dehghani, and M. R. Karim, “Vehicle speed detection in video image sequences using CVS method”, International Journal of the Physical Sciences, vol. 5(17), pp. 2555-2563, 2010
  • [16] N. Saunier and T. A. Sayed, “Automated Road Safety Analysis with Video Data”, Transportation Research Records: Journal of Transportation Research Board, no. 2019, pp. 57-64, 2007
  • [17] S. Stankovic, I. Orovic, A. Krylov, ”Video Frames Reconstruction based on Time-Frequency Analysis and Hermite projection method,” EURASIP Journal on Advances in Signal Processing,Special Issue on Time-Frequency Analysis and its Application to Multimedia signals, Vol. 2010, Article ID 970105, 11 pages, 2010
  • [18] I. Orovic, S. Park, S. Stankovic, ”Compressive sensing in Video applications,” 21st Telecommunications Forum TELFOR, Nov. 2013.
  • [19] I. Djurovic, S. Stankovic, ”Estimation of time-varying velocities of moving objects in video-sequences by using time-frequency representations,” IEEE Transactions on Image Processing, Vol.12, No.5, pp.550-562, 2003.
  • [20] A. Hakeem, K. Shafique, and M. Shah, “An object based video coding framework for video sequences obtained from static cameras,” in Proceedings of the 13th annual ACM International Conference on Multimedia, pp. 608-617, Singapore, November 2005.
  • [21] I. Djurovic, S. Stankovic, A. Ohsumi, H. Ijima, ”Motion parameters estimation by new propagation approach and time-frequency representations,” Signal Processing Image Communications, vol. 19, No. 8 pp. 755-770, 2004
  • [22] Jiyan Pan, Bo Hu, and Jian Qiu Zhang,”An Efficient Object Tracking Algorithm with Adaptive Prediction of Initial Searching Point”, Dept. of E. E., Fudan University. 220 Handan Road, Shanghai 200433, P.R.
  • [23] Compressed Sensing Theory and Applications Mark A. Davenport, Stanford University, USA, Marco F. Duarte, Duke University, Durham, USA, Yonina C. Eldar, Stanford University, USA, Gitta Kutyniok, Technische Universität Berlin, Germany