A Compressive Sensing Video dataset using Pixel-wise coded exposure

05/24/2019 ∙ by Sathyaprakash Narayanan, et al. ∙ indian institute of science 0

Manifold amount of video data gets generated every minute as we read this document, ranging from surveillance to broadcasting purposes. There are two roadblocks that restrain us from using this data as such, first being the storage which restricts us from only storing the information based on the hardware constraints. Secondly, the computation required to process this data is highly expensive which makes it infeasible to work on them. Compressive sensing(CS)[2] is a signal process technique[11], through optimization, the sparsity of a signal can be exploited to recover it from far fewer samples than required by the Shannon-Nyquist sampling theorem. There are two conditions under which recovery is possible. The first one is sparsity which requires the signal to be sparse in some domain. The second one is incoherence which is applied through the isometric property which is sufficient for sparse signals[9][10]. To sustain these characteristics, preserving all attributes in the uncompressed domain would help any kind of in this field. However, existing dataset fallback in terms of continuous tracking of all the object present in the scene, very few video datasets have comprehensive continuous tracking of objects. To address these problems collectively, in this work we propose a new comprehensive video dataset, where the data is compressed using pixel-wise coded exposure [3] that resolves various other impediments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 4

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Surveillance videos are generated ubiquitously and having humans to monitor them constantly is quite a challenge. In most cases, a combination of an anomaly detection algorithm, object detection algorithm, etc.. are used to detect any aberration in these footage’s. The problem being that, each frame has to be processed and it requires an immense amount of memory and resources to store and process the footage’s. Also, the privacy has always been a constant fear, that hinders the usage of these raw footage’s. For our rescue, we have the compressive sensing algorithm by which the amount of data to be processed is reduced by

K (compression rate) times, which not only reduces the space to store the data but in turn helps to expedite the process. A compressed frame Fig.1 is obtained by compressing K number of frames along the temporal window by which it saves the memory for storing K-1 frames. Hence the computation is required only for

Figure 1: Compression using pixel wise coded exposure

In a compressed frame 2

the spatial information present is very hard to recondite even by a human or any conventional algorithm. This acts as an additional layer of encryption by encapsulating the moving information present in the data and allows only the owner to reconstruct back the whole scene. We leverage on this this fact to convert the videos into the sparse domain and reduce the memory required for storage and the overall processing power for any any kind of signal processing directly on the compressed frames at the same moment converse the private information preset. We hope to embark a new stream of research on this sector.

Figure 2: Compressed frame

2 Challenges for the dataset

2.1 Video

  1. Video Processing
    Since during the compression all the frames are to be in natural order without any kind of artificial transitions or angle cuts, since the compression would also consider the transition as a motion and would incorporate it for the compression. Hence, the dataset should also adhere to the fact that the video should not contain any artificial processing like angle cuts or transitions. [2].

  2. Motion and Environment
    Since being the first dataset of its kind, we tend to capture all sorts of motions. That would fill the spectrum of motions ranging from moving object and stationary background, moving an object and changing background, stationary object and changing background, stationary object and stationary background Fig.3.

Figure 3: Motion blur as both the object and the camera is in motion

2.2 Labelling

  1. Objects and Tracking
    Since in compressive sensing all the objects in the set of K frames are being compressed onto a single frame, the resultant tend to contain abstruse information in the compressed frame. The videos in the dataset should have the labels for each frame tracked along time (tracking) [8]. There exist few related works in the same lines that contains the tracking information of certain objects. But it lacks the fact that all the objects present in the video are labeled as shown in Fig 7 and tracked along with the frames. As this is an essential feature for any kind of operation that is to be performed directly on the compressed frame, we have to store all the information of each frame of the video.

Datasets on a collective note would be used in different countries for different intentions. Hence, the generalization of scenes from all parts of the world that could resemble a generic form should be considered. So as to incorporate some variance into the dataset.


Thus to solve all the above challenges collectively and to savor the embodiment’s available in this domain, we present out work that comprehensively pertains almost every aspect that is required for any kind of advancement in this field.

3 Dataset

We searched the Youtube 8M[7] dataset for videos which had only a single person or a single car. As all the variations were not available, we collected more videos of our own within our university campus for more training data. Since these are the preliminary stages of research we

We made sure that only one object of interest was present in all of the videos captured. We captured videos on multiple days and during different parts of the day and tried to maintain as much variance as possible. The videos include objects moving with a stationary background, stationary objects with moving background and both the object and background moving 4. The amount of movement in the frames were also varied.


Figure 4: (a)Both person and the background in motion (b)Static background and moving person (c)Stationary persona and moving background (d)Both car and the camera is motion (e)Both background and object stationary (f)Static background and moving car

We chose Persons and Car as the only(currently) classes because it was difficult to collect data of other classes and also these two were the two most important object classes for surveillance tasks.

Hitomi et.al.,[3] showed that compression rate between 9X-18X and bump time(Tb) 3-4 produces the best PSNR in the reconstruction of compressive sensed frames. We set the bump duration(Tb) to 3 frames and the compression to 13X for all the CS frames in the dataset as these would be more likely used in real-world cases. We also changed the sensing matrix while creating each CS frame, on speculation for research in this domain.

Being the first of its kind of dataset, we tried to provide the preliminary label attributes for the dataset.

  • Object class

  • Boundary box information (tracked along the temporal window of frames)

  1. Methodology

    1. Compression Technique using PCE
      We took videos with 30 FPS and compressed them with 13X compression rate. The bump time Te is set to 3. For every 13 frames in the original video, a single compressed frame is generated using the above equation. The random sensing matrix has been changed for each set of 13 frames, so as to generalize compressed sensed frames and not to fit a particular sensing matrix. We normalize the pixel values after adding them through 13 frames to constrain the pixel value within 255 Fig.3. We only worked with monochrome videos because conventionally only monochromatic CMOS sensors [1] are available.

      Figure 5: Pipeline for compression
    2. Psuedo labels for ground truth
      Since this was the first attempt to do object detection in CS frames, there was no dataset available publicly for this purpose, we had to collect our own training data. Manual labeling of the CS frames is a difficult job to do as the edges of moving objects often are indeterminable even to human eye. To avoid this problem, we chose pseudo-labeling of the data. As most of the existing architectures like YOLO v3[4] detect persons and cars classes with really good accuracy, we used trained YOLO v3 to pseudo label the CS frames Fig.7.

      Figure 6: Pseudo labelling

      We also had to come up with a way to draw bounding boxes in the CS domain. So, we chose to draw a bigger bounding box in the CS frame which enclosed all the individual bounding boxes of the constituent frames. Fig.6 shows the method to merge the bounding boxes of each object across the constituent frames. We take the bounding box coordinates of an object in each frame, find the minimum and maximum of its X and Y coordinates and use them as the coordinates for the bigger bounding box in the CS frame. Tracking a single object’s detection in consecutive frames which contain multiple objects of the same class is difficult because YOLO doesn’t order them in any particular fashion. Because of this difficulty to attach each bounding box and track it along the frames to a specific object, for training dataset we collected videos which have only one object of interest i.e., frames having only one and the same car or person in each frame 7.

      Figure 7: Merging boundary box
      Figure 8: Vatic Annotation

      As shown in Fig.5 we used a pre-trained YOLO model to generate the bounding boxes of the individual frames, which were then combined into larger bounding boxes that were used as the ground truth values of the bounding boxes in the CS frames.

      We fed these pseudo labeled boxes generated by YOLO with the corresponding frames onto VATIC[6] Where a human corrected the errors Fig.8. possessed by the Yolo Object detection Model, this kind of pseudo labeling has made it easy to create a large dataset of annotated compressive sensed images.

      We wanted less human intervention for the labeling, the pseudo labeling provided an extra hand for the human introspection for labeling. By providing the pseudo labeling the human effort here was reduced and more attention was drawn towards reiterating the boundary box more tightly to fit the objects more precisely and to label the non detected objects in the frame by the Object detection Model Fig.8.

      Figure 9: File Content
    3. File format

      Frames: To store compose the dataset in a coherent format, we used NPZ file format by numpy that provides storage of array data using gzip compression. This imageio plugin supports data of any shape and also supports multiple images per file. The line-up composition of these images is as given in the Fig.9. This kind of setup was realized to elucidate the computation and to automate the process.

      Labels : The JSON file holds the for every frame.

4 Discussion and Future work

To our knowledge, this is the first dataset dedicated to compressive sensing. We aspire a new division of research in these lines which would solve the pressing problems present in computation and space constraints. This data contains a collection of video that is captured by us and a significant portion of video from Youtube8-M dataset. The youtube videos were manually inspected and clipped, so as to avoid any artificial transitions and angle cuts to adhere to the compression theory.

Being the first of its kind, we choose videos with very less density so as to set a baseline for research in the field. The dataset has 284 clips of cars and 91 clips of person of approximately total of 90 minutes of footages put together, with differnt aspect ratios are available in the dataset. In the near future, we tend to release the dataset with different compression rates and bump times in terms of compression with more density videos as well. We also tend to release the segmentation information, pose estimation and other kinds of annotation for this dataset. o as to complete the dataset with all variations and fill the spectrum required for the research and development in this domain. In addition, we set the baseline for a new research topic that would enhance and reduce the computation and hardware expenses to a great extent.

References

  • [1] Xiong, Tao and Zhang, Jie and Thakur, Chetan Singh and Rattray, John and Chin, Sang Peter and Tran, Trac D and Etienne-Cummings, Ralph, Real-time segmentation of on-line handwritten arabic script. In 2017 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1–1, IEEE, 2017.
  • [2] Baraniuk, Richard G and Goldstein, Thomas and Sankaranarayanan, Aswin C and Studer, Christoph and Veeraraghavan, Ashok and Wakin, Michael B, Compressive video sensing: algorithms, architectures, and applications In IEEE Signal Processing Magazine, Volume 34, pages 52-66, IEEE, 2017.
  • [3] Hitomi, Yasunobu and Gu, Jinwei and Gupta, Mohit and Mitsunaga, Tomoo and Nayar, Shree K, Video from a single coded exposure photograph using a learned over-complete dictionary In

    2011 International Conference on Computer Vision

    , pages 287–294, IEEE, 2011.
  • [4] Redmon, Joseph and Farhadi, Ali, An incremental improvement In arXiv preprint arXiv:1804.02767 , 2018.
  • [5] Andreas Geiger and Philip Lenz and Raquel Urtasun, Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite In

    Conference on Computer Vision and Pattern Recognition (CVPR)

    2012.
  • [6] Vondrick, Carl and Patterson, Donald and Ramanan, Deva, Efficiently scaling up crowdsourced video annotation, In International Journal of Computer Vision, Volume 101, number 1, pages 184–204, Springer, 2013.
  • [7] Abu-El-Haija, Sami and Kothari, Nisarg and Lee, Joonseok and Natsev, Paul and Toderici, George and Varadarajan, Balakrishnan and Vijayanarasimhan, Sudheendra, Youtube-8m: A large-scale video classification benchmark, In arXiv preprint arXiv:1609.08675, 2016.
  • [8] Wu, Yi and Lim, Jongwoo and Yang, Ming-Hsuan, Online Object Tracking: A Benchmark, In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June, 2013.
  • [9] Donoho, David L, For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution, In Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, Volume 59, number 6, page number 797–829, Wiley Online Library, 2006
  • [10] Davenport, M, The fundamentals of compressive sensing, In IEEE Signal Processing Society Online Tutorial Library, 2013
  • [11] Erlich, Yaniv and Chang, Kenneth and Gordon, Assaf and Ronen, Roy and Navon, Oron and Rooks, Michelle and Hannon, Gregory J, DNA Sudoku—harnessing high-throughput sequencing for multiplexed specimen analysis In Genome research, Volume 19, number 7, pages 1243–1253, Cold Spring Harbor Lab, 2009.