I Introduction
Transmitting video continuously from resource constrained sensors is very challenging because of the need to capture, compress and transmit video as power efficiently as possible, video being one of the highest bit rate signals possible. Classical distributed video coding (DVC) techniques assume implicitly that capturing video is powerefficient, and focus on compressing video as efficiently as possible, so that it can be transmitted viably from the sensor. This means that computationally intensive operations, such as motion compensated prediction (MCP), have to be moved to the decoder. This would lead to a rise in transmission bitrate, and is countered by transmitting some of the frames at a lower bit rate, normally resulting in lower visual quality at the decoder. However, we can leverage the correlation between frames to restore video quality. The distributed source coding theories of Slepian and Wolf [1973slepian] and Wyner and Ziv [1976wyner], inform us that we can transmit correlated data without exploiting correlation at the encoder, at a lower rate than that required if we assume the data is uncorrelated, and still recover the full information at the decoder.
[h!](topskip=0pt, botskip=0pt, midskip=0pt)[width=0.475]Images/PSNRvsTime.pdf PSNR (dB) versus execution time per frame (s) for Video CS algorithms, calculated for the first 17 frames of six CIF video sequences described in section IV. SGSOF [2020chen] PSNR and execution time are reported from the original paper, that employed a comparable simulation platform.
Compressive sensing (CS) [2006Candes], [2006Donoho] represents a departure from the normal source coding paradigm, of sampling an analogue signal at the Nyquist rate, digitizing the samples at the highest possible signaltoquantizationnoiseratio and then using complex source coding algorithms to compress the data as much as possible. In CS, the capture and compress processes are combined to obviate the need for computationallyintensive source coding algorithms. However, using information theoretic arguments, Goyal [2008goyal] cautions that the compression factors achievable with CS alone, are lower than those that can be achieved by classical source coding paradigms. This means that some form of lowcomplexity source coding is still required prior to transmission.
CS senses a multidimensional signal
by performing the inner product between all the components of vector
and rows of a measurement matrix . The CS operation generates an dimensional compressed signal , that is . This reduces the dimensionality of from N to the dimensionality of , achieving a compression ratio, . Note that a higher compression ratio means more measurements.Applying CS to video frames requires the storage of the measurement matrix at the encoder. Contemporary video CS techniques operate on a video signal with a group of pictures (GOP) structure. The leading key frame in the GOP is coded with a higher quality than the remaining nonkey frames. The key frame is typically coded with a compression ratio in the range 0.4 to 0.7 in the literature [2010Mun] [2016zhao] [2020chen]. This means that even for modest resolution, common intermediate format (CIF) video, that is with pixels per frame, the storage requirement of the measurement matrix is of the order of gigabytes when representing each matrix coefficient by an integer (e.g., 16 bits) and is impractical for resource constrained sensors. Generating the matrix on the fly would be too power inefficient. The solution to this problem, proposed in the literature [2007Gan], [2012Fowler], is to break the video frames into subimage blocks of size pixels, where B is typically 4, 8, 16 or 32. The same measurement matrix is then used to compressively sense each block.
However, blockbased CS of video frames comes with two problems. The first arises from the fact that the number of measurements required by CS theory is proportional to the sparsity of the signal , that is , where is a constant with typical values in the range {} [baraniuk2008simple]. The sparsity is the number of nonzero coefficients, in some domain
, that can represent the signal with the desired quality. If the video frame is processed as one block, there is one sparsity value for the frame. With blockbased processing, each block has its own sparsity level, that is changing from frame to frame. In adaptive blockbased image CS, the sparsity of each block is estimated prior to compressively sensing it
[2020zammit]. The challenge is to estimate the sparsity with as low a complexity and overhead as possible.The second problem caused by blockbased coding, is that of blocking artefacts appearing in the reconstructed image, if the CS reconstruction is also blockbyblock. This can be solved by applying deblocking filters or by sensing the image blockbyblock but reconstructing it as a whole, using a fullimage sensing matrix that is composed of the block sensing matrices on its diagonal, that is [2007Gan].
In the following, bold capital letters represent matrices, bold lower case letters represent vectors, and normal case letters, scalar values. Consider a multidimensional signal with a sparse representation in some domain , that is . Then the compressive samples of , produced by the measurement matrix are given by where is the measurement noise. The reconstruction of requires the solution of this equation. However, this is an illposed problem because the sensing matrix and
. A tractable solution can be pursued by first casting the problem as a convex linear program:
(1) 
where is a measure of the noise level, and solving it using stateoftheart LP solvers. This is nontrivial and time consuming, and a significant number of reconstruction techniques have been proposed to accelerate the solution, such as matching pursuit [1993mallat], Bayesian [2009baron], approximate message passing (AMP) [2009Donoho] and denoising AMP (DAMP) [2016Metzler].
Recently, convolutional neural networks (CNNs) have been applied to solve equation (1) [2015mousavi], [2017metzler], [2018Zhao], [2020zammit]. A number of authors have proposed CNNs that can reconstruct compressively sensed images in 10’s to 100’s of milliseconds, such as Reconnet [2016kulkarni], CSNet [2017shi], ISTANet and ISTANet+ [2018zhang], and SCSNet [2019shi]. The reconstruction times of these CNNs allows CS video to be transmitted at tens of frames per second, but without exploiting temporal correlation between frames, generating large bitrates.
Video CS techniques have been proposed that exploit temporal correlation by multihypothesis prediction, motion compensated prediction or optical flow at the decoder such as ME/MC [2010Mun], RRS [2016zhao], and SGSOF [2020chen]. Recently video CS (VCS) has been reconstructed using CNNs, for example VCSNet[2020ashi].
In this paper we present a realtime video compressive sensing framework that leverages plugandplay CNNs to exploit temporal correlation between frames at the decoder, using videoframe interpolation (VFI). The DAIN [2019bao] VFI CNN allows our VCS adaptive linear DCT VFI (VALVFI) algorithm, shown in figure I, to achieve stateoftheart PSNR and MSSSIM [2003wang] performance. The main contributions of this paper are as follows:

In a departure from the current focus to design decoders that execute solely on the GPU, we design hybrid decoders that use the GPU to accelerate two computationally intensive components of our proposed algorithms, video frame interpolation and full image reconstruction.

We propose a video CS framework using adaptive linear DCT measurements (VAL) that exploits temporal correlation at the encoder and the decoder. At the encoder, two algorithms, THI and MDD are proposed to leverage temporal correlation by adapting the block measurements based on the transform coefficients measured in previous frames.

We prove that using a linear 2DDCT measurement matrix allows temporal DPCM at the decoder by simply filtering and mixing transform coefficients from the key and nonkey frames.

Coupled with the above, and using a plugandplay, CNNbased video frame interpolation module, we reach stateoftheart PSNR and MSSSIM performance, with an execution time that is two orders of magnitude lower that current stateoftheart methods.

Additionally, we can reconstruct the video using the recently proposed iterative denoising algorithm (IDA) [2020zammit] that can compressively reconstruct a full image from the adaptively and deterministically sensed blocks, to improve the video quality.
Ii Related work
Fowler and Mun [2011mun] proposed the MCBCSSPL algorithm which incorporates blockbased motion estimation (ME) and compensation (MC) at the decoder. In a GOP with pictures, they recover the first frames with forward reconstruction starting from the intraframe, and the remaining frames, by reconstructing them from the reference frame in the next GOP. They then refine the frames by bidirectional prediction from the previous and following frames. Chen et al. [2011Chen] took inspiration from the this framework and proposed multihypothesis predictions (BCSSPLMH) as an improvement for both still images and video. An initial reconstruction is performed using standard BCSSPL [2011mun], and weighted Tikhonov regularization is then used to form predictions of the residual to iteratively improve the reconstruction.
Recently, Zhao et al. [2018Zhao] proposed a twophase hybrid intraframe, interframe algorithm. The first phase exploits spatial correlation to produce highquality reference frames. In the second, the reweighted sparsity of the residual difference between frames is recovered using an algorithm based on split Bregman iteration. Experimental results are presented for the first 16 frames of six popular common CIF video sequences. The GOP consists of an intraframe followed by seven interframes. Blockbased sensing is assumed using Gaussian random projections on pixel blocks. Key frames are sensed at a compression ratio of 0.7 and nonkey frames at 0.2. Multihypothesis prediction is used to reconstruct the interframes. The proposed reweighted residual sparsity (RRS) scheme is compared against four representative video CS reconstruction methods, and was shown to achieve stateoftheart performance. However, the reconstruction time per frame was reported to exceed 5 minutes.
SGSOF [2020chen] is closest to our work, both in terms of concept and results. The authors adopt a GOP with size 8 and compressively sense key and nonkey frames with compression ratios 0.7 and 0.1. They propose a structural group sparsity (SGS) model and employ the alternating direction method of multipliers (ADMM) variant of the augmented Lagrangian method (ALM), to reconstruct the independent frames, and then use the CoarsetoFine optical flow (OF) method to create a fused image using the forward and backward OF motion estimates. Finally they use the BCSSPL [2009mun] algorithm to refine the original reconstruction using residual estimation and compensation. The authors investigate both Gaussian and partial DCT measurement matrices and report large gains with the deterministic Partial DCT matrices. The authors quote a reconstruction time of around 15s per frame using the partial DCT sensing matrix.
Iii Distributed video compressive sensing
Inspired by distributed video compressive sensing techniques, we develop the VAL framework described in this section and evaluate it empirically in section IV. The block diagram of the proposed VALVFI algorithm is shown in figure 1. The GOP structure in figure 2 is adopted wherein key frames are sensed with a high average key compression ratio so that the reconstructed quality is high, where is the number of pixels and the number of measurements. Nonkey frames are sensed with a substantially lower . We place the keyframes at the start of the GOPs and then predict the nonkeyframes from the keyframes in the current and following GOP, using VFI.
The encoder captures frames using two adaptive VALDD algorithms inspired by the ALDCTDD algorithm in [2020zammit]. is the Key frame in GOP . is the th nonkey frame in GOP , where is the GOP size. The encoder then transmits the compressively sensed 2DDCT transform coefficients and the number of transform coefficients per block sensed in the th frame of the GOP.
The decoder decodes the key and nonkey frames, in realtime, using the inverse 2D DCT in the VALDD decoder, and buffers one GOP worth of transform coefficients and reconstructed nonkey frames. This is necessary because the VFI algorithm requires the key frame in the current GOP and that in the following GOP , besides the th nonkey frame in the current GOP.
The VFI block uses the two key frames and to interpolate the nonkey frames in between. The decoder then computes the temporal DPCM in the transform domain to predict the received frame from the reference frame and nonkey transform coefficients . The DPCM output is then compared with in the best pixel discriminator block that outputs or to produce the reconstructed nonkey frame . The reconstructed key frame is equals to .
Better quality can be achieved if the VALDD reconstruction is carried out using the IDA algorithm with the DnCNN denoiser [2020zammit].
Iiia Adaptive block compressive sensing
At the encoder, transform coefficients (TC) are sensed in two phases. Measurements from the first phase are used to estimate the number of adaptive measurements in the second phase. We propose two improved versions of the ALDCTDD algorithm in [2020zammit]; threshold over the whole image (THI) and mixedmode DCT domain (MDD).
THI is defined in algorithm 1. It collects half the TCs equally from all blocks in phase one. These are then used to estimate the number of phase2 coefficients based on the proportion of the largest phase1 coefficients from all blocks in the image that fall in the current block.
When compressively sensing successive nonkey frames in a video sequence, one would have the benefit of having collected significantly more TCs per block from the key frame, which is sensed at a substantially higher rate. We thus propose another algorithm that adapts the phasetwo measurements based on the reference keyframe phase1 TCs, but replacing the lowpass TCs with the phase1 TCs of the current frame. We refer to this as the mixedmode DCT domain (MDD) as described in algorithm 2).
IiiB IDA reconstruction
The IDA algorithm proposed in [2020zammit] can reconstruct adaptive block based images sensed using the deterministic 2D DCT measurement matrix. We propose to use IDA to increase the PSNR and MSSSIM of the reconstructed video. The IDA algorithm is an iterative thresholding algorithm that uses the CNNbased, DnCNN denoiser [2017zhang].
IiiC Video frame interpolation
VFI has been extensively studied over the past decades to upsample the frame rate of video content to match the ever increasing video monitor frame rates and produce smoother playback and slowmotion effects. Recently, CNNs have been leveraged to generate interpolated frames in realtime [2019nah]. One of the better performing CNNbased systems is DAIN [2019bao] which can interpolate frames in realtime. Key frames are encoded with a higher quality (i.e., with lower compression) than nonkey frames. Therefore, we leverage video frames interpolated from two key frames by DAIN [2019bao] as highquality estimates of nonkey frames. We then use the VFI output frame as the input to the temporal DPCM block, which continues to improve video quality.
IiiD Differential pulse code modulation
Temporal DPCM exploits correlation between frames. Consider the hybrid DPCM/DCT codec in figure 3. A pixel block from keyframe is transformed using a 2D DCT operation generating transform coefficients . We define the compressive LDCTZZ operation, as that operation that retains lowpass transform coefficients in JPEG [jpeg] zigzag order. is the number of transform coefficients retained in a nonkey frame and are the number of extra midband DCT coefficients retained in a keyframe, over and above those in the nonkey frame. The resulting transform coefficients are transmitted to the decoder, where an inverse 2D DCT reconstructs the key frame estimate .
To exploit the correlation between a key frame block and a nonkey frame block , the key frame block estimate is subtracted from the nonkey frame block. The resulting difference block is transformed using the 2D DCT, generating transform coefficients which are lowpass filtered, retaining only upper triangular transform coefficients . At the decoder, is inverse transformed to reconstruct an estimate of the difference block that is added to the frame block estimate to estimate the nonkey frame block .
Let the key frame pixel block be arranged into a column vector . The linear unitary 2D DCT transform then transforms into the column vector of TCs
(2) 
where is the transform matrix and . The lowpass filtering operation is accomplished by elementwise multiplication (denoted by ) of with a matrix consisting of 1’s in the upper left hand triangle and 0’s elsewhere, arranged as a vector , such that the LDCTZZ vectorized coefficients of the key frame are given by
(3) 
The LDCTZZ vectorized coefficients of the difference block are given by
(4) 
since and , given that and are disjoint ().
Equation (3) shows that the key frame entails transmitting the lowpass and midband LDCTZZ coefficients. Equation (4) shows that the DPCM difference transform coefficients are composed of the lowpass coefficients of the nonkey frame from which are subtracted from the lowpass coefficients of the key frame. Therefore, the encoder can be simplified to just two LDCTZZ encoders one for the key frames retaining transform coefficients and one for the nonkey frames retaining just coefficients. The difference block coefficients can then be calculated at the decoder by subtracting the key frame lowpass coefficients from the nonkey frame lowpass coefficients .
IiiE Realtime reconstruction
The VALDD encoder transmits the transform coefficients and the number of transform coefficients per block to the decoder. For the key frames, the decoder then reconstructs each image block using the inverse linear 2DDCT. The reconstructed key block arranged as a column vector is given by
(5) 
This can be achieved in realtime using matrix multiplication or fast implementations of the inverse 2DDCT.
The nonkey frame can be computed by first calculating at the encoder as described in IIID, multiplying it by the inverse 2D DCT matrix to generate the difference block and then adding it to key frame. In vector form
(6) 
However, we can also write
(7) 
Thus the nonkey frame can also be computed by first mixing (adding) the midband transform coefficients of the key frame to the lowpass transform coefficients of the nonkey frame and calculating the inverse 2D DCT transform as shown in the distributed version of the DPCM operation in figure 4. Note that in figure 1, corresponds to key frames in figure 4, and to nonkey frames for .
IiiF Best pixel discriminator
In unchanging parts of a frame, selecting keyframe blocks results in a higherquality rendition. VFI results in estimating highquality keyframe blocks and the subsequent DPCM process combines them with the lowpass current blocks. If the VFI is not accurate, image artefacts appear in the output blocks and this occurs where the image is changing most dynamically.
Following reconstruction of the current nonkey frame and VFI from the key frame, we have three versions of the reconstructed frame; is low resolution nonkey frame with no artefacts, is the higher resolution VFI predicted frame but with the possibility of image artefacts, and which is the DPCM output that may also have visible artefacts. To reduce the visibility of VFI artefacts, one option would be to average the two reconstructions together and this will indeed result in a better PSNR and MSSSIM.
A second option is to compare the pixels at location in both and and if the modulus of the difference is less than a threshold then the output pixel is set to ; otherwise it is set to .
This selection operation was found to improve the output quality considerably. The value of was investigated empirically. A value of was found to give good results with this best pixel discriminator (BPD) block in the VALVFI versions of our algorithms, whereas gave the best results with VALIDAVFI.
The BPD results in lowpass pixels at the edges of moving objects in the current output frame. However the visual quality is still better as the human vision system is less sensitive to errors at moving boundaries.
IiiG VALVFI parameters
The VALVFI framework is characterised by a number of parameters that have to be taken into consideration, ideally optimized, namely: DCT block size; GOP size; compression ratio (or subrate) , for key and nonkey frames; IDA damping factor and iterations, if used; and BPD threshold .

ME/MC  RRS  SGSOF  VALVFI  VALIDAVFI  VALIDAVFI  VALVFI  VALIDAVFI  










Paris  29.63, 0.9872  27.07, 0.9853  32.00, 0.9946  32.28, 0.9933  33.82, 0.9932  31.79, 0.9938  31.26, 0.9941  33.72, 0.9953  
Foreman  32.85, 0.9829  36.51, 0.9943  37.20, 0.9952  35.45, 0.9903  37.19, 0.9891  37.10, 0.9924  35.51, 0.9916  36.61, 0.9920  
Coastguard  30.14, 0.9470  31.20, 0.9535  32.40, 0.9692  32.35, 0.9680  32.70, 0.9670  32.10, 0.9745  34.31, 0.9832  34.49, 0.9833  
Hall  41.12, 0.9960  35.68, 0.9947  40.30, 0.9953  42.02, 0.9959  42.21, 0.9946  41.19, 0.9951  41.05, 0.9959  41.78, 0.9959  
Mobile  21.39, 0.9059  23.96, 0.9610  27.13, 0.9897  28.33, 0.9900  28.75, 0.9896  27.33, 0.9890  27.82, 0.9903  29.45, 0.9929  
News  38.39, 0.9967  34.11, 0.9965  39.88, 0.9978  40.91, 0.9974  42.18, 0.9973  41.11, 0.9978  40.77, 0.9977  41.39, 0.9978  
Average  32.26, 0.9693  31.42, 0.9809  34.82, 0.9903  35.22, 0.9892  36.14, 0.9885  35.10, 0.9904  35.12, 0.9921  36.24, 0.9929 
Iv Simulation results
The simulations in this section were executed on a server equipped with an Intel Xeon CPU E5160 v3 clocked at 3.50 GHz, with 32GB of RAM, running MATLAB version 2019a on Linux 18.04. IDA requires the DAMPToolbox from [2017DAMP]. The DnCNN code and models were downloaded from [2017DnCNN] and require the MatConvNet [2015vedaldi] package from [2017MatConvNet]. The DAIN code was downloaded from the author’s site [2020DAIN]
. We compiled it with Pytorch version 1.1 and it runs under Python 3.7.9. The VALIDA code exchanges Key and nonKey frames with DAIN at run time, using a filebased interface. The source code for this paper is available at:
https://github.com/jzamm/val. Note that the algorithms presented in this paper extend to higher resolution images, the included code can be utilized without any modifications. The image test sets below were chosen to compare with published work, claiming SOTA performance, for which the source code was not available.The VALVFI and VALIDAVFI algorithms were compared with three algorithms in the literature; ME/MC [2011Chen], RRS [2016zhao] and SGSOF [2020chen]. Six popular CIF video sequences were used for testing as in [2020chen]. We refer to this set as VidSet6. The first seventeen frames of VidSet6 were used with the seventeenth frame used by VFI to compute key frames in the second GOP, but PSNR and MSSSIM [2003wang] results are only reported for the luminance (Y) plane of the first sixteen frames. The simulation uses blocks of size B = 16. The key compression ratio , nonkey compression ratio and GOP size are related by the following equation:
(8) 
This ensures a constant average number of CS measurements per GOP. , and were initially set to 8, 0.7 and 0.1 respectively and then varied to optimize PSNR and MSSSIM results as indicated below. Note that the higher the compression ratio, the more measurements are collected by the compressive sensing.
The original ME/MC code from [2012fowlercode] and RRS code from [2018Zhao] were modified to set the measurement matrix to the same DCT measurement matrix used by our algorithms. The use of the lowpass DCT measurement matrix improves the results of both ME/MC and RRS over the respective published results. SGSOF code was not available and therefore, the SGSOF results in table I are from the original paper in [2020chen].
Iva PSNR and MSSSIM
The average PSNR and MSSSIM results of VALVFI and VALIDAVFI are compared with ME/MC, RRS and SGSOF in table I. When the GOP size is 8, and the key and nonkey frame compression ratios are 0.7 and 0.1 respectively, VALVFI exceeds the performance of ME/MC, RRS and SGSOF by 2.96 dB, 3.80 dB and 0.40 dB respectively. It also exceeds the MSSSIM of ME/MC and RRS by 0.0199 and 0.0083, but is 0.0011 short of SGSOF. IDA reconstruction improves the VALVFI PSNR by 0.92 dB but decreases MSSSIM marginally by 0.0007.
The GOP and values were then varied to optimise the VALVFI and VALIDAVFI performance. When GOP=8, and as computed by equation (8), VALIDAVFI exceeds the stateoftheart PSNR and MSSSIM performance of SGSOF by 0.28 dB and 0.0001. If the GOP size is reduced to 4, and , the VALVFI PSNR and MSSSIM exceed that of SGSOF by 0.30 dB and 0.0018. VALIDAVFI with GOP=4, and achieves our best results and improves VALVFI PSNR and MSSSIM by 1.12 dB and 0.0008 respectively.
Figure 6 compares the PSNR of the Y component of the first 16 frames of the VidSet6 sequences. The GOP size is 8 in all cases except for VALIDAVFI* for which the GOP size is 4. The compression ratio of the key frame is 0.7 for ME/MC, RRS and SGSOF whereas it is 0.6 for VALIDAVFI and 0.5 for VALIDAVFI*. The average compression ratio over the whole GOP is 0.175 in all cases. As can seen from the figure, the nonkey PSNR for VALIDAVFI* is superior to the other algorithms in the Paris, Coastguard and Mobile sequences. VALIDAVFI has the best nonkey PSNR in the Hall and News sequences whereas SGSOF prevails in the Foreman sequence. The variability of the PSNR is least for VALIDAVFI* with the key compression ratio equal to 0.5.
IvB Visual quality and execution time
The visual quality of the fourteenth frame of the Paris sequence produced by the VALVFI and VALIDAVFI algorithms is compared with the original frame and the output of ME/MC and RRS algorithms in figure 5. VALVFI and VALIDAVFI (GOP = 4, ) both have better rendition than ME/MC and RRS, with no artefacts like ME/MC and with sharper rendition. VALIDAVFI improves on the VALVFI quality. The nonoptimized VALVFI reconstructs a frame in around 190ms, VALIDAVFI in 2.1 seconds (20 iterations of the DnCNN algorithm, PSNR = 36.15 dB, MSSSIM = 0.9928), ME/MC in 15s and RRS in 557s. SGSOF is reported in [2020chen] to reconstruct a frame in 15s on a server with equivalent processing power.
V Conclusion
We have proposed the VALVFI and VALIDAVFI algorithms that exploit adaptive, blockbased compressive sensing of video frames in the spatial domain, using deterministic DCT matrices. The reconstruction quality of the compressively sensed frames can be enhanced using an iterative denoising algorithm.
Our algorithms exploit temporal correlation at the encoder using the MDD adaptivity estimation algorithm to match the compressive sensing ratio with the underlying block sparsity. At the decoder we exploit temporal correlation using a GOP structure and a video frame interpolation CNN to predict nonkey frames from higherquality key frames. The quality of the VFI frames is then enhanced by performing temporal DPCM at the decoder. Finally, a best pixel discriminator can select the best pixel from the DPCM output or the reconstructed nonkey frame, depending on the pixel error between the VFI prediction and the DPCM reconstruction. Simulation results show that our algorithms achieve stateoftheart performance. The improvement in performance is shown in figure I.
Future work can study VALVFI in an endtoend transmission system, digitally encoding the adaptive measurements prior to transmission.
Comments
There are no comments yet.