Because image sensor chips have a finite bandwidth with which to read out pixels, recording video typically requires a trade-off between frame rate and pixel count. Compressed sensing techniques can circumvent this trade-off by assuming that the image is compressible. Here, we propose using multiplexing optics to spatially compress the scene, enabling information about the whole scene to be sampled from a row of sensor pixels, which can be read off quickly via a rolling shutter CMOS sensor. Conveniently, such multiplexing can be achieved with a simple lensless, diffuser-based imaging system. Using sparse recovery methods, we are able to recover 140 video frames at over 4,500 frames per second, all from a single captured image with a rolling shutter sensor. Our proof-of-concept system uses easily-fabricated diffusers paired with an off-the-shelf sensor. The resulting prototype enables compressive encoding of high frame rate video into a single rolling shutter exposure, and exceeds the sampling-limited performance of an equivalent global shutter system for sufficiently sparse objects.READ FULL TEXT VIEW PDF
As an alternative to conventional multi-pixel cameras, single-pixel came...
Photography usually requires optics in conjunction with a recording devi...
The measurement rate of cameras that take spatially multiplexed measurem...
Compressed sensing has been discussed separately in spatial and temporal...
Manifold amount of video data gets generated every minute as we read thi...
We demonstrate a compressed sensing, photon counting lidar system based ...
Several coded exposure techniques have been proposed for acquiring high ...
All digital imaging sensors have a finite bit rate for exporting the digital measurement. This limited bit rate restricts the space-time bandwidth of the system, forcing a trade-off between temporal and spatial resolution. Traditionally, increasing the frame rate while maintaining pixel count requires increasing the chip bandwidth, which is expensive. Compressive video approaches seek to break this trade-off by spatio-temporally compressing the video data prior to exporting the bits, effectively encoding more information into the limited bandwidth. While most work in compressive video has focused on redesigning the readout architecture of CMOS chips, we instead propose a compressive video scheme based on optical multiplexing using a diffuser. We demonstrate the concept using a simple lensless camera with an off-the-shelf rolling shutter sensor. Our system effectively encodes 140 frames into a single still image.
Increasing the frame rate of a sensor with fixed bandwidth can be achieved by reading a subset of pixels at each frame. However, when using one-to-one imaging optics (i.e. lenses) that map each scene point to a point on the sensor, information is lost from parts of the sensor that are not sampled. Figure 1(a) illustrates a sensor with a narrow band of pixels actively recording, placed at the image plane of a lens, with a simple scene consisting of two point sources. The cyan source falls outside of the active exposure band and is therefore not measured. To solve this problem, we propose using spatial-multiplexing optics such that even a small subset of sensor pixels (e.g. one row of a 2D array) contain information from most scene points. Our approach consists of replacing the lens with a pseudorandom phase diffuser placed near the sensor, which maps each point to a distributed, high-contrast pattern of caustics on the sensor. As shown in Fig. 1(b), the information from every scene point falls on nearly all sensor pixels, and is therefore present in the band of rows being read. Recovering a video from a sequence of row measurements then requires solving an underdetermined inverse problem. Because the diffuser produces pseudorandom noise-like measurements, we interpret this as a compressive sensing system, reconstructing the video using sparsity-constrained nonlinear optimization.
To implement this idea, we leverage the ubiquity of rolling shutter CMOS sensors. During capture of a single image, rolling shutter sensors expose each row of pixels over a unique time window. This encodes temporal information into the 2D measurement. By randomly multiplexing the scene onto such a sensor, we can recover a video of a dynamic scene wherein each frame corresponds to a row of the rolling shutter capture.
Our experimental prototype recovers 140 frames of video at frames-per-second (fps) from a single 2D rolling shutter capture. The system is built using a dual-shutter sCMOS sensor (Fig. 2). We analyze the spatial and temporal resolution of the system and show that, for sparse scenes, the spatial resolution significantly surpasses that of much more expensive global shutter approaches at comparable frame rates.
To capture high-speed dynamics with conventional sensors, one must overcome the bandwidth limit of digital imaging chips. Compressive video works by spatio-temporally coding the video data prior to capture. Rather than capture a video, then compress it to exploit redundancies, compressive video does the compression step in hardware and captures only relevant data. For example, Hitomi et al. proposed a compressive video acquisition scheme that reconstructs a high-speed video from a single image ( temporal upsampling at 1000 fps) . The approach relied on pixel-wise programmable exposure timing to modulate the recorded image temporally during the acquisition. Reconstruction was performed through a dictionary of space-time signal patches that is learned offline. Experimentally, the approach used a spatial light modulator (SLM) and global shutter sensor, but could theoretically be implemented on-chip in a CMOS architecture. Using strobed exposure with unique sequences, Veeraraghavan et al. reconstructed a high-speed video of periodic events at 2000 fps from a video captured by a camera operating at 25 fps . Another technique, proposed by Llull and Yuan et al., achieved high-speed video reconstruction (22 frames at 660 fps) from a single-shot coded-aperture image that is obtained by translating binary amplitude masks within the focal plane of a global shutter sensor [3, 4]. Koller et al. later improved the mask design  and Liu et al. proposed a reconstruction that exploits the low-rank structure of the underlying scene . The commonality between these setups is that each pixel is temporally modulated during the exposure, and all require bulky and expensive hardware. Our technique, in contrast, uses simple optics and spatial multiplexing rather than temporal.
Rolling shutter can induce undesirable artifacts when imaging dynamic scenes. Removal of such artifacts is an active field of study. Liang et al. characterized and corrected the geometric distortions . Saurer et al. considered extensions for stereo imaging and registration with rolling shutter cameras . When camera motion exists, Su and Heidrich  proposed an approach to reconstruct a sharp image by simultaneously removing the motion blur and rolling shutter distortions.
Rather than undoing the effects of rolling shutter sensors, we seek to leverage them for performance. Gu et al. have proposed controlling the readout timing and exposure length for each row  such that the exposure time discrepancy in subsequent rows enables one to flexibly sample the 3D space-time volume of the dynamic scene. In simulations, their architecture-level proposal was beneficial for computational photography applications such as high dynamic range (HDR) imaging and auto-exposure, but did not successfully resolve video using sparse recovery methods. Oieke and Gamal proposed another architecture that used spatial multiplexing at the chip-level, which allowed them to reach 1920 fps data rate for 256256 pixel count. Another method uses digital micro-mirror devices (DMDs) for aperture coding and streak cameras with femtosecond speeds to reconstruct ultrafast videos (10 trillion fps) from a single image [11, 12]. Liu et al. considered similar ideas and used a galvanometer to perform streaking (i.e. temporal shearing of the scene) . While this concept is similar to ours in spirit, they do not consider spatial multiplexing and they rely on complex, costly hardware. Finally, Sheinin et al. recently used rolling shutter and spatial multiplexing to detect and de-mix the contributions from flickering light bulbs in a scene, providing useful information about the power grid. The authors observed that spatial-multiplexing via a diffuser enabled observation of spatio-temporal information, but they do not considering high-speed imaging directly .
Spatially-multiplexed image capture has been a key ingredient for compressive imaging . Using amplitude masks, Salman et al. realized such ideas on a lensless and compact system . Diffuser (i.e. phase mask)-based lensless cameras have been shown to be capable of 2D imaging , and single-shot 3D imaging . Here, we show that diffusers are useful optical elements for compressive video systems, allowing each frame of video to be sampled from a small subset of sensor pixels. Our system can be calibrated from a single image, fabricated using simple lab equipment, and reconstructed using computationally-efficient convolution-based algorithms.
In this section, we outline a forward model for the optics and the rolling shutter exposure, as well as the inverse problem approach. We will use this model to analyze the temporal resolution of the system in Section V.
In general, the exposure at each point on the sensor, , can be modeled as a temporal integral,
where represents the time-varying optical intensity on the sensor, and is a 3D indicator, the shutter function, that encodes the temporal exposure window at each position. While our approach could be generalized to different exposure patterns, we focus on rolling shutter due to its ubiquity. Rolling shutter is a column-parallel approach in which each row of pixels exposes for seconds, beginning at a delay, , after the previous row began (typically tens-of-microseconds). Because rolling shutter records row-by-row, we drop the -dependence of the shutter function, denoting it as for the remainder of the paper. At any given instant, a small band of rows is actively recording photons. For a sensor with pixel size , this is depicted in Fig. 3, with red indicating where . Our goal is to spatially multiplex scene information into the exposure band at each time point, which enables each band to produce a frame of the final video, achieving frame rates equal to fps.
In order to achieve the desired multiplexing, we use a simple lensless architecture (see Fig. 4) that employs a diffuser – a pseudorandom phase optic – as a computational imaging element [18, 19]. The system comprises a diffuser placed a distance from the rolling shutter sensor, with the scene at distance from the diffuser. An aperture placed on the diffuser ensures that the resulting Point Spread Function (PSF) is shift-invariant, and enables simple calibration [18, 19]. For magnification , the sensor plane intensity can be modeled by convolving the magnified scene intensity, , with , the on-axis PSF :
where denotes linear convolution over . The diffuser’s PSF fills nearly the entire sensor with a pseudorandom caustic intensity pattern that is unique for each shift. This high degree of spatial multiplexing is key to how our system works, enabling any horizontal slice of to contain information about nearly all positions in the scene.
To solve for the video, we need a discrete forward model. We treat the measurement as a vector of samples taken from the continuous exposure: , where and index the sensor rows and columns, respectively. This leads to a discretized (magnified) scene, denoted , on a 3D spatio-temporal grid with lateral spacing . The temporal spacing is , as discussed in section V. This leads to the linear discrete forward model:
where represents discrete linear 2D convolution over the spatial dimensions, is the discrete shutter function, and is the number of recovered frames. Note that for global shutter, this would be a cropped convolution identical to [18, 17], but here we absorb the crop into the definition of . This linear forward model, denoted in matrix form, is depicted in Fig. 4.
To recover a video from a single rolling shutter measurement, we must solve an underdetermined linear inverse problem. For a dual-shutter camera such as ours, each symmetric pair of rows in the measurement corresponds to a frame in the reconstruction, so we recover approximately frames from a single capture. The diffuser produces pseudorandom noise-like measurements, so our system fits within the framework of compressed sensing (as demonstrated in ). Hence we can solve the underdetermined problem for sparsely-represented scenes using minimization. We impose a weighted 3D total variation (3DTV) prior on the scene, so the reconstructed video, , can be written as the solution to:
where is the matrix of forward finite differences in the , , and directions. We include an additional tuning parameter, , that weights the temporal gradient sparsity penalty relative to the spatial dimensions (typically set between 5 and 30). We use FISTA  with the weighted anisotropic 3DTV proximal operator, implemented using parallel proximal methods according to . For computational efficiency, we never instantiate the matrix explicitly, but instead compute the matrix-vector products and
using a combination of zero-padding, FFT-based convolutions, and cropping. Each color channel of the video is processed separately, using the corresponding color from the calibrated PSF. This inherently compensates for much of the chromatic aberration in the system.
We built our prototype around a PCO Edge 5.5 sCMOS sensor, set to slow-scan rolling shutter mode. The dual shutter reads simultaneously from the top and bottom of the sensor.
Our homemade diffuser consists of randomly spaced lenslets. Because the lenslets concentrate light into sharp points, random lenslets have been shown to perform well in low-light situations , as is typical with high-speed imaging. Additionally, the uniformly random lateral placement of the lenslets ensures that each scene point produces a unique pattern on the sensor, and contributes a similar amount of light to each exposure band. This is not true near the edge of the sensor, as discussed in Section V-D.
We fabricate our random lenslet diffusers using the molding process outlined in Section V-C. Each lenslet comprising the diffuser has a focal length of mm, yielding an approximately by (width-by-height) half field-of-view (FoV), which is reasonable for photographic scenes . The system is calibrated using a single image of a white point source placed in the scene. Figure 5 shows a 16-bit color image of the PSF along with its 2D autocorrelation.
To test our system, we captured a variety of dynamic scenes. The raw data is downsampled by either 4 or 8 to match the expected temporal bandwidth (see Section V-A). Videos are reconstructed at voxel grid for downsampling, or for . In both cases the video spans milliseconds. Two example reconstructions are shown in Fig. 6. The first is a tennis ball dropping into a hand. The second is a green foam dart ricocheting off of an apple placed on a text book. In both cases, motion is clearly visible with good temporal detail present (see Supplementary Videos ). Due to system geometry, the outer sensor rows are relatively insensitive to the center of the object, degrading the quality of the first 30-40 frames. This is not a fundamental limit of our approach, but is rather a consequence of our implementation (see Sec. V-D for more discussion).
In this section, we analyze the temporal behavior of the system, showing that the temporal frequencies are band-limited by the exposure time. This motivates the design choices of our prototype, including the diffuser, exposure time, and use of binning (downsampling).
Next, we analyze the temporal frequency content of the measurements to validate temporal resolution. Intuitively, short exposure times are required to achieve high temporal resolution. We will show that, because our system is only compressive in space, its temporal resolution is Nyquist limited, with an inherent band-limit set by the exposure time , and the sampling rate determined by the line time, . To show this, we begin by writing an expression for . As depicted in Fig. 3, is a 1D temporal rectangular window of width seconds, offset by seconds per row:
where represents the row index. Substituting this into the continuous model for rolling shutter acquisition, Eq. 1:
where we define for compactness. Upon inspection, we see that this is a 1D convolution in the time dimension between the time-varying intensity at the sensor, , and a rectangular window of width . The result of the convolution is evaluated along the slice of 3D space-time defined by :
This captures both the temporal band-limiting inherent in the exposure process as well as the mapping from time to row. Next we substitute Eq. 2, the expression for the spatially-multiplexed video, into Eq. 8:
where and the convolutions have been reordered, associating the temporal low-pass filter with the input signal. This shows that, while we are multiplexing in space, the temporal information in the system is band-limited by the pixel exposure time.
Finally, we introduce sampling. As shown in Section III-C, the measured image is generated by sampling on a grid of spacing . Applying this sampling to the arguments of Eq. 9, we get . In other words, due to the implicit mapping of time to space, the rolling shutter effectively samples at a rate of Hz. Hence we expect to avoid temporal aliasing when , even if the scene contains faster dynamics. This is also why, as discussed in Section III-C, we discretize the video on a temporal grid of spacing .
For our sensor, the minimum exposure time is , with a maximum line time of . This would result in significant temporal oversampling, which is computationally wasteful. Thus, in practice, we use a combination of lateral downsampling of the raw data and temporal binning of the reconstruction to maintain inter-frame times of (4,545 fps), which better matches the minimum exposure time. Hence we expect to observe dynamics up to kHz at best. Note that our reconstruction is highly nonlinear, relying heavily on nonnegativity and 3DTV denoising. As a result, this analysis represents only an upper bound to the frequencies we can hope to recover. In practice, measurement noise, calibration error, and regularization reduce performance (see Fig. 7).
As experimental validation of spatial and temporal resolution, we use a linear array of 4 LEDs flashing in unison with variable frequency square waves. We space the LEDs at the minimum separation resolvable by our system, which we establish empirically by varying the spacing until the LEDs are barely resolved in the reconstructions (6 mm separation at a distance mm from the diffuser, or angular resolution). We use an exposure time of , so rows are exposing in each band. This should result in maximum frequency of Hz.
This dynamic scene can be expressed as , where represents the 2D distribution of LEDs, and is the modulating waveform. For such an object, the intensity inside the camera body will be . Plugging this into Eq. 8, we see that the continuous exposure at the sensor will be
where . Therefore we expect the measurement to look like the 2D scene convolved with the PSF and modulated in the -direction by the low-pass filtered waveform. Figure 7 shows raw data from our experimental system. Because the 2D scene is 4 point sources in a line, this appears as 4 laterally shifted copies of the PSF, periodically modulated in the -direction, as expected.
While our analysis provides a bound, experimental errors and nonlinear reconstruction can further deteriorate performance. To test how close we get to the limit, we recorded measurements with LED pulse rates varied from (378.78 Hz) to (1,515.15 Hz), the highest frequency predicted by the theory. The results are shown in Fig. 7. On the left is a raw measurement with temporal period (505 Hz). A strong envelope is clearly visible, modulating the measurement with a period of pixels in the -direction. In the reconstructions, we can clearly resolve all 4 LEDs spatially in all cases. At lower frequencies, the pulses are well resolved in time, with the harmonic structure of the square waves visible in the power spectra. As the period decreases, the temporal contrast reduces, with period being totally unresolved.
For comparison, to record the same dynamic scene with LEDs pulsing at using global shutter would require 30 frames at greater than fps. Within our system’s sample budget of samples, each frame from the corresponding global shutter system would only contain pixels. This is a degradation in lateral resolution compared to what our compressive scheme achieves experimentally. Hence, at least for sparse scenes, the compressive approach surpasses a direct sampling scheme.
Based on simulations, we found that a diffuser consisting of randomly-spaced lenslets performed better than off-the-shelf diffusers . To fabricate, we repeatedly indent a copper block with a ball bearing of radius 7 mm. The indentations are made at random spacing (by hand) over an area larger than the 14.04 16.64 mm size of the PCO Edge 5.5 sensor. The result is a mold that is piecewise spherical with curvature matching the ball bearing. We use this block as a mold for UV-cured epoxy (Norland 61), with microscope slide on the top surface to ensure flatness. We then cure the epoxy and separate it from the mold. The epoxy has refractive index 1.56, yielding a diffuser with random lenslets of approximate focal length . We mask the diffuser with a 13 mm rectangular aperture, then mount the diffuser approximately 12.4 mm from the sensor. This results in magnification of for objects placed mm away.
Given the structured sampling pattern of a rolling shutter sensor, we can reason about the system FoV geometrically. The set of scene points visible to each sensor pixel is determined by projecting rays from the pixel through the aperture. From this simple picture, we see that each pixel has a unique FoV. Because the rolling shutter pattern reads a band of rows simultaneously, this effectively means the FoV is varying with time: early in the exposure, the outer sensor rows are active, and cannot see the center of the FoV, while the inner rows (later frames) can. Because the sensor is blind to the on-axis points early in the exposure, these frames are determined via the regularizer. This explains the wiping artifact present in our videos in the early frames. If we were to use a single-shutter sensor, the effect would be more pronounced, as the FoV would sweep across the scene. This issue could be alleviated by distributing the active pixels more evenly across the sensor plane or by removing the aperture. In the current system, we simply discard the early frames of the video. In future builds, we could remove or enlarge the aperture, though this will preclude single-image calibration, and will lead to our shift-invariant lensless model breaking down at high angles. Such artifacts are correctable , but lead to much slower processing times, and so we leave this for future work.
For our prototype, there are two main limiting factors: the quality of the optics, and the CMOS sensor dynamics. Because the sensor’s minimum exposure time limits the maximum usable frame rate, sensors with shorter exposure will perform better. Additionally, to match the line time to the exposure time, we would like to freely adjust the sensor’s line timing; however, our sensor does not allow this. This leads us to use spatial downsampling as a workaround to effectively increase the line time to better match the band-limit.
The second limiting factor is the quality of the diffuser. While our homemade diffusers are sufficient for proof-of-concept work, the resulting optics is fairly low quality, and the process is not well controlled. We can achieve the target focal length, but the focal spots (see Fig. 5) are extremely aberrated. This works well with the downsampling approach, as the caustics are not sharp enough to warrant using the full resolution sensor. However, to push our approach to the limit, we would need optics that can produce multiplexed PSFs with very sharp features. Coupled with a sensor capable of short exposures (on the order of the line time), our proposed architecture could achieve extremely high spatio-temporal resolution. For example, our current sensor can operate with line times as fast as , or over 100,000 fps.
Another limiting factor is the reduced measurement signal-to-noise caused by the multiplexing. Pushing this system to 100,000 fps would require exposure times shorter than 10 . Because the light from each point is distributed across the sensor with only a few pixels being recorded in each frame, this would require extremely bright scenes. Additionally, the combination of multiplexing and regularized reconstruction generally reduces the dynamic range of the recovered image, further limiting the method to high contrast scenes.
As with most compressed sensing systems, it is difficult to validate the performance in general, since it is object dependent. We know from prior work  that the performance degrades with scene complexity, and we observe this effect. While it does work for dense scenes, we require higher regularization, effectively limiting the usefulness for scenes that do not fit a gradient sparsity prior well. Introducing more sophisticated priors could mitigate this issue.
Our reconstructions are computationally expensive relative to a direct sampling approach. Achieving extremely short exposures and the fastest line time possible would require not downsampling the measurement, leading to a computationally expensive 3D inverse problem at gigavoxel scale.
While we chose a dual-shutter camera for the experimental validation in this work, exploring the use of different programmable exposures could be extremely fruitful. Demonstrating the system with the more commonplace single shutter CMOS architecture would make it widely accessible, as the only other required equipment is a diffuser. Our current sensor also has a delay far longer than the line time between each sequential frame, preventing us from stringing together sequential frames into longer videos without a gap (see Supplementary Video 4 ). A sensor that streamed continuously could alleviate this. It could also be useful to couple multiplexing optics with randomized sensor read patterns , as this will certainly lead to better video recovery.
In conclusion, we have demonstrated that a spatially-multiplexing lensless camera can turn rolling shutter from a detriment into an advantage. We built a proof-of-concept system that resolves Hz dynamics at a frame rate of frames per second. We derived a theoretical temporal resolution bound based on our forward model, and confirmed our theoretical predictions with experiment. Our system relies on compressed sensing to solve an extremely underdetermined problem. We successfully observed samples with space-time bandwidth product far exceeding what could be observed with a direct sampling approach. Finally, we demonstrated our approach on a variety of fast-moving scenes, reliably recovering high speed videos from single rolling shutter images.
The authors would like to thank The Moore Foundation, DARPA, and Bakar Fellows. This material is based upon work supported by the National Science Foundation under Grant No. 1617794. This work has also been supported by an Alfred P. Sloan Foundation fellowship. Emrah Bostan’s research is supported by the Swiss National Science Foundation (SNSF) under grant P2ELP2 172278.
IEEE International Conference on Computer Vision (ICCV). IEEE, 2011, pp. 287–294.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3318–3325.