DiffuserCam: Lensless Single-exposure 3D Imaging

10/05/2017 ∙ by Nick Antipa, et al. ∙ 0

We demonstrate a compact and easy-to-build computational camera for single-shot 3D imaging. Our lensless system consists solely of a diffuser placed in front of a standard image sensor. Every point within the volumetric field-of-view projects a unique pseudorandom pattern of caustics on the sensor. By using a physical approximation and simple calibration scheme, we solve the large-scale inverse problem in a computationally efficient way. The caustic patterns enable compressed sensing, which exploits sparsity in the sample to solve for more 3D voxels than pixels on the 2D sensor. Our 3D voxel grid is chosen to match the experimentally measured two-point optical resolution across the field-of-view, resulting in 100 million voxels being reconstructed from a single 1.3 megapixel image. However, the effective resolution varies significantly with scene content. Because this effect is common to a wide range of computational cameras, we provide new theory for analyzing resolution in such systems.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: DiffuserCam setup and reconstruction pipeline. Our lensless system consists of a diffuser placed in front of a sensor (bumps on the diffuser are exaggerated for illustration). The system encodes a 3D scene into a 2D image on the sensor. A one-time calibration consists of scanning a point source axially while capturing images. Images are reconstructed computationally by solving a nonlinear inverse problem with a sparsity prior. The result is a 3D image reconstructed from a single 2D measurement.

Because optical sensors are 2D, capturing 3D information requires projection onto a 2D sensor in such a way that the 3D data can be recovered. Scanning and multi-shot methods can achieve high spatial resolution, but sacrifice capture speed and require complex hardware. In contrast, existing single-shot methods are fast but have low resolution. Most 3D imagers require bulky hardware, such as large bench-top microscopes. In this work, we introduce a compact and inexpensive single-shot lensless system capable of volumetric imaging, and show that it improves upon the sampling limit of existing single-shot systems by leveraging compressed sensing.

We present DiffuserCam, a lensless imager that uses a diffuser to encode the 3D intensity of volumetric objects into a single 2D image. The diffuser, a thin phase mask, is placed a few millimeters in front of a traditional 2D image sensor. Each point source in 3D space creates a unique pseudorandom caustic pattern that covers a large portion of the sensor. Due to these properties, compressed sensing algorithms can be used to reconstruct more voxels than pixels captured, provided that the 3D sample is sparse in some domain. We recover this 3D information by solving a sparsity-constrained optimization problem, using a physical model and simple calibration scheme to make the computation and calibration scalable. This allows us to reconstruct on a grid of 100 million voxels, several orders of magnitude more than previous work.

We demonstrate a prototype DiffuserCam system built entirely from commodity hardware. It is simple to calibrate, does not require precise alignment during construction and is light efficient (as compared to amplitude masks). We reconstruct 3D objects on a grid of 100 million voxels (non-uniformly spaced) from a single 1.3 megapixel image, which provides high resolution across a large volume. Our reconstructions show true depth sectioning, allowing us to generate 3D renderings of the sample.

Our system uses a nonlinear reconstruction algorithm, which results in object-dependent performance. We quantify this by experimentally measuring the resolution of our prototype using a variety of test targets, and we show that the standard two-point resolution criterion is misleading and should be considered a best-case scenario. Additionally, we propose a new local condition number analysis that explains the variable resolving power and show that the theory is consistent with our experiments.

DiffuserCam uses concepts from lensless camera technology and imaging through complex media, integrated together via computational imaging design principles. This new device could enable high-resolution lensless 3D imaging of large and dynamic 3D samples in an extremely compact package. Such cameras will open up new applications in remote diagnostics, mobile photography and in vivo microscopy.

To review related previous work, we start with lensless cameras for 2D photography, which have shown great promise because of their small form factors. Unlike traditional cameras, in which a point in the scene maps to a pixel on the sensor, lensless cameras map a point in the scene to many points on the sensor, requiring computational reconstruction. A typical lensless architecture replaces the lens with an encoding element placed directly in front of the sensor. Incoherent 2D lensless cameras have been demonstrated using amplitude masks [1], diffractive masks [2, 3], random reflective surfaces [4, 5], and modified microlens arrays [6] as the encoding element. 2D lensless imaging with coherent illumination has been demonstrated in [7, 8], and extended to 3D [9, 10, 11, 12], but these methods require active or coherent illumination. Our system uses a similar lensless architecture but extends both the design and image reconstruction to enable 3D capture without the need for coherent lighting.

Light field cameras, also called integral imagers, passively capture 4D space-angle information in a single-shot [13], which can be used for 3D reconstructions. This concept can be built into a thin form factor with microlens arrays [14] or Fresnel zone plates [15]. Lenslet array-based 3D capture schemes have also been used in microscopy [16], where improved results are obtained when wave-optical effects [17, 18] or in vivo tissue scattering [19, 18] are accounted for. All of these systems must trade resolution or field-of-view for single-shot capture. DiffuserCam, in contrast, uses compressed sensing to capture large 3D volumes at high resolution in a single exposure.

Diffusers are often used as a proxy for general scattering media in the context of developing methods for imaging through scattering [20, 21, 22]. These works have similar mathematical models to our system, but instead of trying to mitigate the effects of unwanted scattering, here we use the diffuser as an optical element in our system design. Therefore, we choose a thin, optically smooth diffuser that refracts pseudorandomly (as opposed to true random scattering). Such diffusers have been shown to produce high contrast patterns under incoherent illumination, enabling light field imaging  [23], and have also been used to record coherent holograms  [10, 24]. Multiple scattering with coherent illumination has been demonstrated as an encoding mechanism for 2D compressed sensing [25], but necessitates an inefficient transmission matrix calibration approach, limiting reconstructions to a few thousand pixels. We achieve similar benefits without needing coherent illumination, and, unlike previous work, we use compressed sensing to add depth information. Finally, our system is designed to enable simpler calibration and more efficient computation, allowing for 3D reconstruction at megavoxel scales with superior image quality.

1.1 System Overview

DiffuserCam is part of the class of mask-based lensless imagers in which a phase or amplitude mask is placed a small distance in front of a sensor, with no main lens. Our mask (the diffuser) is a thin transparent phase object with smoothly varying thickness (see Fig. 1). When illuminated by an incoherent source sufficiently far from the sensor, the convex bumps concentrate light into high-frequency pseudorandom caustic patterns which are captured by the sensor. The caustic patterns, termed Point Spread Functions (PSFs), vary with the 3D position of the source, thereby capturing 3D information.

To illustrate how the caustics encode 3D information, Fig. 2 shows simulations of caustic PSFs as a function of point source location in object space. A lateral shift of the point source causes a lateral translation of the PSF. An axial shift of the point source causes (approximately) a scaling of the PSF. Hence, each 3D position in the volume generates a unique PSF. The resolution of our camera depends on the structure and spatial frequencies present in the caustic patterns. Because the caustics retain high spatial frequencies over a large range of depths, DiffuserCam attains good lateral resolution for objects at any depth within the volumetric field-of-view (FoV).

By assuming that all points in the scene are incoherent with each other, the measurement can be modeled as a linear combination of PSFs from different 3D positions. We represent this as matrix-vector multiplication:


where is a vector representing the 2D sensor measurement and is a vector representing the intensity of the object at every point in the 3D FoV, sampled on a user-chosen grid (discussed in Section 3). is the forward model matrix whose columns consist of each of the caustic patterns created by the corresponding 3D points on the object grid. The number of entries in and the number of rows of are equal to the number of pixels on the image sensor, but the number of columns in is set by the choice of reconstruction grid. Note that this model does not account for partial occlusion of sources.

In order to reconstruct the 3D image, , from the measured 2D image, , we must solve (1) for . However, if we use the full optical resolution of our system (see Sec. 33.2), will exist on a grid that contains more voxels than there are pixels on the sensor. In this case, has many more columns than rows, so the problem is underdetermined and we cannot uniquely recover simply by inverting (1). To remedy this, we rely on sparsity-based computation principles [26]. We exploit the fact that the object can be represented in a domain in which it is sparse (few non-zero elements) and solve the regularized inverse problem:



is a linear transform that sparsifies the 3D object,

is a nonnegativity constraint, and is a tuning parameter. For objects that are sparse in native 3D space, such as fluorescent samples, we choose

to be the identity matrix. For objects that are not natively sparse, but whose gradient is sparse, we choose

to be the finite difference operator, so is the 3D Total Variation (TV) semi-norm [27].

Equation (2) is the basis pursuit problem in compressed sensing [26]. For this optimization procedure to succeed, must have distributed, uncorrelated columns. The diffuser creates caustics that spread across many pixels in a pseudorandom fashion and contain high spatial frequencies. Therefore, any shift or magnification of the caustics leads to a new pattern that is uncorrelated with the original one. As discussed in Sections 22.2 and 22.3, these two properties lead to a matrix that allows us to reconstruct 3D images from the measurement .

Figure 2: Caustic patterns shift with lateral shifts of a point source in the scene and scale with axial shifts. (a) Ray-traced renderings of caustics as the point source moves laterally. For large shifts (far right), part of the pattern is clipped by the sensor. (b) The caustics approximately magnify as the source is brought closer.

2 Methods

2.1 System Architecture

The hardware setup for DiffuserCam consists of a diffuser placed at a fixed distance in front a sensor (see Fig. 3a). The convex bumps on the diffuser surface can be thought of as a set of randomly-spaced microlenses that have statistically varying focal lengths and f-numbers. The f-number determines the minimum feature size of the caustics, which sets our optical resolution. The average focal length determines the distance at which the caustics have highest contrast (the caustic plane), which is where we place the sensor [23].

Our prototype is built using a PCO.edge 5.5 Color camera ( pixels). The diffuser is an off-the-shelf engineered diffuser (Luminit

) with a flat input surface and an output surface that is described statistically as Gaussian lowpass filtered white noise with an average spatial feature size of 140

m and average slope magnitude of (details in Supplementary Fig. S1). This corresponds to an average focal length of approximately mm and average f-number of . Due to the high f-number, the caustics maintain high contrast over a large range of propagation distances. Therefore, the diffuser need not be placed precisely at the caustic plane; in our prototype, mm from our sensor. Additionally, we affix a 5.5 7.5 mm aperture directly on the textured side of the diffuser to limit the support of the caustics.

Similar to a traditional camera, the pixel pitch should Nyquist sample the minimum features of the caustics. The smallest features generated by the caustic patterns are roughly twice the pixel pitch on our sensor, so we perform 2x2 binning on the data, yielding 1.3 megapixel images, before applying our 3D reconstruction algorithm.

2.2 Convolutional Forward Model

Figure 3: Experimentally determined field-of-view (FoV) and resolution. (a) System architecture with design parameters. (b) Angular pixel response of our sensor. We define the angular cutoff () as the angle at which the response falls to 20%. (c) Reconstructed images of two points (captured separately) at varying separations laterally and axially, near the z = 20 mm depth plane. Points are considered resolved if they are separated by a dip of at least 20%. (d) To-scale non-uniform voxel grid for 3D reconstruction, viewed from above. The voxel grid is based on the system geometry and Nyquist-sampled two-point resolution over the entire FoV. For visualization purposes, each box represents 2020 voxels, as shown in red.

Recovering a 3D image requires knowing the system transmission matrix, , which is extremely large. Measuring or storing the full would be impractical, requiring millions of calibration images and operating on multi-Terabyte matrices. The convolution model outlined below drastically reduces complexity of both calibration and computation.

We describe the object, , as a set of point sources located at on a non-Cartesian grid and represent the relative radiant power collected by the aperture from each source as . The caustic pattern at pixel on the sensor due to a unit-powered point source at is the PSF, . Thus, represents the 2D sensor measurement recorded after the light from every point in propagates through the diffuser and onto the sensor. This lets us explicitly write the matrix-vector multiplication by summing over all voxels in the FoV:


The convolution model amounts to a shift invariance (or infinite memory effect [20, 21]) assumption, which greatly simplifies the evaluation of (3). Consider the caustics created by point sources at a fixed distance, , from the diffuser. Because the diffuser surface is slowly varying and smooth, the paraxial approximation holds. This implies that a lateral translation of the source by leads to a lateral shift of the caustics on the sensor by , where is the paraxial magnification. We validate this behavior in both simulations (see Fig. 2) and experiments (see Section 33.4). For notational convenience, we define the on-axis caustic pattern at depth as . Thus, the off-axis caustic pattern is given by . Plugging into (3), the sensor measurement is then given by:


Here, represents 2D discrete convolution over , which returns arrays that are larger than the originals. Hence, we crop to the original sensor size, denoted by the linear operator . For an object discretized into depth slices, the number of columns of is times larger than the number of elements in (i.e. the number of sensor pixels), so our system is underdetermined.

The cropped convolution model provides three benefits. First, it allows us to compute as a linear operator in terms of images, rather than instantiating explicitly (which would require petabytes of memory to store). In practice, we evaluate the sum of 2D cropped convolutions using a single circular 3D convolution, enabling the use of 3D FFTs, which scale well to large array sizes (for details, see Supplementary Sec. 2). Second, it provides us with a theoretical justification of our system’s capability for sparse reconstruction. Derivations in [28] show that translated copies of a random pattern provide close-to-optimal compressed sensing performance.

The third benefit of our convolution model is that it enables simple calibration. Rather than measuring the system response from every voxel (millions of images per depth), we only need to capture a single calibration image of the caustic pattern from an on-axis point source at each depth plane.111While the scaling effect described in Sec. 11.1 suggests that we could use only one image for calibration and scale it to predict PSFs at different depths, we find that there are subtle changes in the caustic structure with depth. Hence, we obtain better results when we calibrate with PSFs measured at each depth. A typical calibration thus consists of capturing images as a single point source is moved axially. This takes minutes, but need only be performed once. The added aperture on DiffuserCam ensures that a point source at the minimum distance generates caustics that just fill the sensor, so that the entire PSF is captured in each image (see Supplementary Fig. S2).

2.3 Inverse Algorithm

Our inverse problem is extremely large in scale, with millions of inputs and outputs. Even with the convolution model described above, using projected gradient techniques is extremely slow due to the time required to compute the proximal operator of 3D total variation [29]. To alleviate this, we use the Alternating Direction Method of Multipliers (ADMM) [30]. We derive a variable splitting that leverages the specific structure of our problem.

Our algorithm uses the fact that can be written as a circular convolution for both the 3D TV and native sparsity cases. Additionally, we factor the forward model in (4) into a diagonal component, , and a 3D convolution matrix, , such that (details in Sec. ). Thus, both the forward operator and the regularizer can be computed in 3D Fouier space. This enables us to use variable-splitting [31, 32, 33] to formulate the constrained counterpart of (2):


where , and are auxiliary variables. We solve (5) by following the commonly-used augmented Lagrangian arguments [34]. Using ADMM, this results in the following scheme at iteration :


Note that is a vectorial soft-thresholding operator with a threshold value of [35], and , and are the Lagrange multipliers associated with , , and , respectively. The scalars , and are penalty parameters which we compute automatically using the tuning strategy in [30].

Although our algorithm involves two large-scale matrix inversions, both can be computed efficiently and in closed form. Since is diagonal, is itself diagonal, requiring complexity to invert using point-wise multiplication. Additionally, all three matrices in

are diagonalized by the 3D discrete Fourier transform (DFT) matrix, so inversion of the entire term can be done using point-wise division in 3D frequency space. Therefore, its inversion has good computational complexity,

, since it is dominated by two 3D FFTs being applied to total voxels. We parallelize our algorithm on the CPU using C++ and Halide [36], a high performance programming language specialized for image processing (A comparison of regularizers and runtimes is shown in Supplementary Sec. 2c).

A typical reconstruction requires at least 200 iterations. Solving for million voxels takes 26 minutes (8 seconds per iteration) on a 144-core workstation and requires 85 Gigabytes of RAM. A smaller reconstruction ( million voxels) takes 3 minutes (1 second per iteration) on a 4-core laptop with 16 Gigabytes of RAM.

3 System Analysis

Unlike traditional cameras, the performance of computational cameras depends on properties of the scene being imaged (e.g. the number of sources). As a consequence, standard two-point resolution metrics may be misleading, as they do not predict resolving power for more complex objects. To address this, we propose a new local condition number metric that better predicts performance. We analyze resolution, FoV and the validity of the convolution model, then combine these analyses to determine the appropriate sampling grid for faithfully reconstructing real-world objects.

3.1 Field-of-View

Figure 4: Our computational camera has object-dependent performance, such that the resolution depends on the number of points. (a) To illustrate, we show here a situation with two points successfully resolved at the two-point resolution limit at a depth of approximately 20 mm. (c) However, when the object consists of more points (16 points in a 44 grid in the plane) at the same spacing, the reconstruction fails. (b,d) Increasing the separation to

gives successful reconstructions. (e,f) A close-up of the raw data shows noticeable splitting of the caustic lines for the 16 point case, making the points distinguishable. Heuristically, the 16 point resolution cutoff is a good indicator of resolution for real-world objects.

At every depth in the volume, the angular half-FoV is determined by the most extreme lateral position that contributes to the measurement. There are two possible limiting factors. The first is the geometric angular cutoff, , set by the aperture size, , the sensor size, , and the distance from the diffuser to the sensor, (see Fig. 3a). Since the diffuser bends light, we also take into account the diffuser’s maximum deflection angle, . This gives a geometric angular half-FoV at every depth of . The second limiting factor is the angular response of the sensor pixels. Real-world sensor pixels may not accept light at the high angles of incidence that our lensless camera accepts, so the sensor angular response (shown in Fig. 3b) may limit the FoV. Defining the angular cutoff of the sensor, , as the angle at which the camera response falls to 20% of its on-axis value, we can write the overall FoV equation as:


Since we image in 3D, we must also consider the axial FoV. In practice, the axial FoV is limited by the range of calibrated depths. However, the system geometry creates bounds on possible calibration locations. Calibration cannot be arbitrarily close to the sensor since the caustics would exceed the sensor size. To account for this, we impose a minimum object distance such that an on-axis point source creates caustics that fill the sensor. Theoretically, our system can capture sources infinitely far away, but the axial resolution degrades with depth. The hyperfocal plane represents the axial distance after which no more axial resolution is available, establishing an upper bound. Objects beyond the hyperfocal focal plane have no depth information, but can be reconstructed to create 2D images for photographic applications [37], without any hardware modifications.

In our prototype, the axial FoV ranges from the minimum calibration distance (7.3 mm) to the hyperfocal plane (2.3 m). The angular FoV is limited by the pixel angular acceptance ( in , in ). Combined with our diffuser’s maximum deflection angle () this yields an angular FoV of in and in . We validate the FoV experimentally by capturing a scene at optical infinity and measuring the angular extent of the result (see Supplementary Fig. S3).

3.2 Resolution

Investigating optical resolution is critical for both quantifying system performance and choosing our reconstruction grid. Although the raw data is collected on a fixed sensor grid, we can choose the non-uniform 3D reconstruction grid arbitrarily. The choice of reconstruction grid is important. When the grid is chosen with voxels that are too large, resolution is lost, and when they are too small, extra computation is performed without resolution gain. In this section we explain how to choose the grid of voxels for our reconstructions, with the aim of Nyquist sampling the two-point optical resolution limit.

3.2.1 Two-point resolution

A common metric for resolution analysis in traditional cameras is two-point distinguishablity. We measure our system’s two-point resolution by imaging scenes containing two point sources at different separation distances, built by summing together images of a single point source (1m pinhole, wavelength ) at two different locations.

We reconstruct the scene using our algorithm, with to remove the influence of the regularizer. To ensure best-case resolution, we use the full 5 MP sensor data (no binning). The point sources are considered distinguishable if the reconstruction has a dip of at least between the sources, as in the Rayleigh criterion. Figure 3c shows reconstructions with point sources separated both laterally and axially.

Our system has highly non-isotropic resolution (Fig. 3d), but we can use our model to predict the two-point distinguishability over the entire volume from localized experiments. Due to the shift invariance assumption, the lateral resolution is constant within a single depth plane and the paraxial magnification causes the lateral resolution to vary linearly with depth. For axial resolution, the main difference between two point sources is the size of their PSF supports. We find pairs of depths such that the difference in their support widths is constant:


Here, and are neighboring depths and is a constant determined experimentally.

Based on this model, we set the voxel spacing in our grid to Nyquist sample the 3D two-point resolution. Figure 3d shows a to-scale map of the resulting voxel grid. Axial resolution degrades with distance until it reaches the hyperfocal plane (2.3 m from the camera), beyond which no depth information is recoverable. Due to the non-telecentric nature of the system, the voxel sizes are a function of depth, with the densest sampling occurring close to the camera. Objects within 5 cm of the camera can be reconstructed with somewhat isotropic resolution; this is where we place objects in practice.

3.2.2 Multi-point resolution

In a traditional camera, resolution is a function of the system and is independent of the scene. In contrast, computational cameras that use nonlinear reconstruction algorithms may incur degradation of the effective resolution as the scene complexity increases. To demonstrate this in our system, we consider a more complex scene consisting of 16 point sources. Figure 4 shows experiments using 16 point sources arranged in a 44 grid in the plane at two different spacings. The first spacing is set to match the measured two-point resolution limit (=45m, =336m). Despite being able to separate two points at this spacing, we cannot resolve all 16 sources. However, if we increase the source separation to (=75m, =448m), all 16 points are distinguishable (Fig. 4d). In this example, the usable lateral resolution of the system degrades by approximately due to the increased scene complexity. As we show in Section 33.3, the resolution loss does not become arbitrarily worse as the scene complexity increases.

This experiment demonstrates that existing resolution metrics cannot be blindly used to determine performance of computational cameras like ours. How can we then analyze resolution if it depends on object properties? In the next section, we introduce a general theoretical framework for assessing resolution in computational cameras.

3.3 Local condition number theory

Our goal is to provide new theory that describes how the effective reconstruction resolution of computational cameras changes with object complexity. To do so, we introduce a numerical analysis of our forward model to determine how well it can be inverted.

First, note that recovering the image from the measurement

comprises simultaneous estimation of the locations of all nonzeros within our image reconstruction,

, as well as the values at each nonzero location. To simplify the problem, suppose an oracle tells us the exact locations of every source within the 3D scene. This corresponds to knowing a priori the support of , so we then need only determine the values of the nonzero elements in . This can be accomplished by solving a least squares problem using a sub-matrix consisting of only the columns of that correspond to the indices of the nonzero voxels. If this problem fails, then the more difficult problem of simultaneously determining the nonzero locations and their values will certainly fail.

In practice, the measurement is corrupted by noise. The maximal effect this noise can have on the least-squares estimate of the nonzero values is determined by the condition number of the sub-matrix described above. We therefore say that the reconstruction problem is ill-posed if any sub-matrices of are very ill-conditioned. In practice, ill-conditioned matrices result in increased noise sensitivity and longer reconstruction times, as more iterations are needed to converge to a solution.

The worst-case scenario in our system is when multiple sources are in a contiguous block, since nearby measurements are always most similar. Therefore, we compute the condition number of sub-matrices of corresponding to a group of point sources with separation varying by integer numbers of voxels. We repeat this calculation for different numbers of sources. The results are shown in Fig. 5. As expected, the conditioning is worse when sources are closer together. In this case, increased noise sensitivity means that even small amounts of noise could prevent us from resolving the sources. This trend matches what we saw experimentally in Figs. 3 and 4.

Figure 5 shows that the local condition number increases with the number of sources in the scene, as expected. This means that resolution will degrade as more and more sources are added. We can see in Fig. 5 that, as the number of sources increases linearly, the conditioning approaches a limiting case. Hence, the resolution does not become arbitrarily worse with increased number of sources. Therefore we can estimate the system resolution for complex objects from distiguishability measurements of a limited number of point sources. This is experimentally validated in Section 4, where we find that the experimental 16-point resolution is a good predictor of the resolution for a USAF target.

Unlike traditional resolution metrics, our new local condition number theory explains the resolution loss we observe experimentally. We believe that it is sufficiently general to be applicable to other computational cameras, which likely exhibit similar performance loss.

Figure 5: Our local condition number theory shows how resolution varies with object complexity. (a) Virtual point sources are simulated on a fixed grid and moved by integer numbers of voxels to change the separation distance. (b) Local condition numbers are plotted for sub-matrices corresponding to grids of neighboring point sources with varying separation (at depth 20 mm from the sensor). As the number of sources increases, the condition number approaches a limit, indicating that resolution for complex objects can be approximated by a limited number (but more than two) sources.

3.4 Validity of the Convolution Model

Figure 6: Experimental validation of the convolution model. (a)-(c) Close-ups of registered experimental PSFs for sources at , and . The PSF at is visually similar to that on-axis, while the PSF at has subtle differences. (d) Inner product between the on-axis PSF and registered off-axis PSFs as a function of source position. (e) Resulting spot size (normalized by on-axis spot). The convolution model holds well up to , beyond which resolution degrades (solid). Exhaustive calibration would improve the resolution (dashed), at the expense of complexity in computation and calibration.

In Section 22.2, we modeled the caustic pattern as shift invariant at every depth, leading to simple calibration and efficient computation. Since our convolution model is an approximation, we wish to quantify its applicability. Figure 6a-c shows registered close-ups of experimentally measured PSFs from plane waves incident at , and . The convolution model assumes that these registered PSFs are all exactly the same, though, in reality, they have subtle differences. To quantify the similarity across the FoV, we plot the inner product between each off-axis PSF and the on-axis PSF (see Fig. 6d). The inner product is greater than 75% across the entire FoV and particularly good within of the optical axis, indicating that the convolution model holds relatively well.

To investigate how the spatial variance of the PSF impacts system performance, we use the peak width of the cross-correlation between the on-axis and off-axis PSFs to approximate the spot size off-axis. Figure 

6e (solid) shows that we retain the on-axis resolution up to . Beyond that, the resolution gradually degrades. To avoid model mismatch, one could replace the convolution model with exhaustive calibration over all positions in the FoV. This procedure would yield higher resolution at the edges of the FoV, as shown by the dashed line in Fig. 6e. The gap between these lines is what we sacrifice in resolution by using the convolution model. However, in return, we gain simplified calibration and efficient computation, which makes the large-scale problem feasible.

4 Experimental Results

Figure 7: Experimental 3D reconstructions. (a) Tilted resolution target, which was reconstructed on a 4.2 MP lateral grid with 128 -planes and cropped to 64064050 voxels. The large panel shows the max projection over . Note that the spatial scale is not isotropic. Inset is a magnification of group 2 with an intensity cutline, showing that we resolve element 5 at a distance of 24 mm, which corresponds to a feature size of 79 (approximately twice the lateral voxel size of at this depth). The degraded resolution matches our 16-point distinguishability (75 at 20 mm depth). Lower panels show depth slices from the recovered volume. (b) Reconstruction of a small plant, cropped to 480320128 voxels, rendered from multiple angles.

Images of two objects are presented in Fig. 7. Both were illuminated using white LEDs and reconstructed with a 3D TV regularizer. We choose a reconstruction grid that approximately Nyquist samples the two-point resolution (by binning the sensor pixels to yield a 1.3 megapixel measurement). Calibration images are taken at 128 different -planes, ranging from to (from the diffuser), with spacing set according to conditions outlined in Sec. 33.2. The 3D images are reconstructed on a 20482048128 grid, but the angular FoV restricts the usable portion of this grid to the center 100 million voxels. Note that the resolvable feature size on this reconstruction grid can still vary based on object complexity.

The first object is a negative USAF 1951 fluorescence test target, tilted about the -axis (Fig. 7a). Slices of the reconstructed volume at different planes are shown to highlight the system’s depth sectioning capabilities. As described in Sec. 33.2, the spatial scale changes with depth. Analyzing the resolution in the vertical direction (Fig. 7a inset), we can easily resolve group 2 element 4 and barely resolve group 2 element 5 at mm. This corresponds to resolving features m apart on the resolution target. This resolution is significantly worse than the two-point resolution at this depth (m), but similar to the 16-point resolution (m). Hence, we reinforce our claim that two-point resolution is a misleading metric for computational cameras, but multi-point distinguishability can be extended to more complex objects.

Finally, we demonstrate the ability of DiffuserCam to image natural objects by reconstructing a small plant (Fig. 7b). Multiple angles are rendered to demonstrate the ability to capture the 3D structure of the leaves.

5 Conclusion

We demonstrated a simple optical system, with only a diffuser in front of a sensor, that is capable of single-shot 3D imaging. The diffuser encodes the 3D location of point sources in caustic patterns, which allow us to apply compressed sensing to reconstruct more voxels than we have measurements. By using a convolution model that assumes that the caustic pattern is shift invariant at every depth, we developed an efficient ADMM algorithm for image recovery and simple calibration scheme. We characterized the FoV and two-point resolution of our system, and showed how resolution varies with object complexity. This motivated the introduction of a new condition number analysis, which we used to analyze how inverse problem conditioning changes with object complexity.


This work was funded by the Gordon and Betty Moore Foundation Grant GBMF4562 and a NSF CAREER grant to L.W. B.M. acknowledges funding from the Hertz Foundation and G. K. is a National Defense Science and Engineering Graduate Fellow. R.H. and E.B. acknowledge funding from the Swiss NSF (grants P2EZP2 159065 and P2ELP2 172278, respectively). This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA), Contract No. N66001-17-C-4015. The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. The authors thank Dr. Eric Jonas and the Rice FlatCam team for helpful discussions.


  • [1] M. S. Asif, A. Ayremlou, A. C. Sankaranarayanan, A. Veeraraghavan, and R. G. Baraniuk, “Flatcam: Thin, bare-sensor cameras using coded aperture and computation,” CoRR abs/1509.00116 (2015).
  • [2] D. G. Stork and P. R. Gill, “Optical, mathematical, and computational foundations of lensless ultra-miniature diffractive imagers and sensors,” International Journal on Advances in Systems and Measurements 7, 4 (2014).
  • [3] P. R. Gill, J. Tringali, A. Schneider, S. Kabir, D. G. Stork, E. Erickson, and M. Kellam, “Thermal escher sensors: Pixel-efficient lensless imagers based on tiled optics,” in “Computational Optical Sensing and Imaging,” (Optical Society of America, 2017), pp. CTu3B–3.
  • [4] R. Fergus, A. Torralba, and W. T. Freeman, “Random Lens Imaging,” Tech. rep., Massachusetts Institute of Technology (2006).
  • [5]

    A. Stylianou and R. Pless, “Sparklegeometry: Glitter imaging for 3d point tracking,” in “Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops,” (2016), pp. 10–17.

  • [6] J. Tanida, T. Kumagai, K. Yamada, S. Miyatake, K. Ishida, T. Morimoto, N. Kondou, D. Miyazaki, and Y. Ichioka, “Thin observation module by bound optics (tombo): concept and experimental verification,” Applied Optics 40, 1806–1813 (2001).
  • [7] W. Harm, C. Roider, A. Jesacher, S. Bernet, and M. Ritsch-Marte, “Lensless imaging through thin diffusive media,” Optics Express 22, 22146–22156 (2014).
  • [8] W. Chi and N. George, “Optical imaging with phase-coded aperture,” Optics express 19, 4294–4300 (2011).
  • [9] D. J. Brady, K. Choi, D. L. Marks, R. Horisaki, and S. Lim, “Compressive holography,” Opt. Express 17, 13040–13049 (2009).
  • [10] K. Lee and Y. Park, “Exploiting the speckle-correlation scattering matrix for a compact reference-free holographic image sensor,” Nature Communications 7 (2016).
  • [11]

    W. Bishara, T.-W. Su, A. F. Coskun, and A. Ozcan, “Lensfree on-chip microscopy over a wide field-of-view using pixel super-resolution,” Optics Express

    18, 11181–11191 (2010).
  • [12] H. Faulkner and J. Rodenburg, “Movable aperture lensless transmission microscopy: a novel phase retrieval algorithm,” Physical Review Letters 93, 023903 (2004).
  • [13] R. Ng, M. Levoy, M. Bredif, G. Duval†, M. Horowitz, and P. Hanrahan, “Light field photography with a hand-held plenoptic camera,” Stanford University Computer Science Tech Report pp. 3418–3421 (2005).
  • [14] R. Horisaki, S. Irie, Y. Ogura, and J. Tanida, “Three-dimensional information acquisition using a compound imaging system,” Optical Review 14, 347–350 (2007).
  • [15] K.Tajima, T. Shimano, Y. Nakamura, M. Sao, and T. Hoshizawa, “Lensless light-field imaging with multi-phased fresnel zone aperture,” in “2017 IEEE International Conference on Computational Photography (ICCP),” (2017), pp. 76–82.
  • [16] M. Levoy, R. Ng, A. Adams, M. Footer, and M. Horowitz, “Light Field Microscopy,” ACM Trans. Graph. (Proc. SIGGRAPH) 25 (2006).
  • [17] M. Broxton, L. Grosenick, S. Yang, N. Cohen, A. Andalman, K. Deisseroth, and M. Levoy, “Wave optics theory and 3-D deconvolution for the light field microscope,” Optics Express 21, 25418–25439 (2013).
  • [18] H.-Y. Liu, E. Jonas, L. Tian, J. Zhong, B. Recht, and L. Waller, “3d imaging in volumetric scattering media using phase-space measurements,” Opt. Express 23, 14461–14471 (2015).
  • [19] N. C. Pégard, H.-Y. Liu, N. Antipa, M. Gerlock, H. Adesnik, and L. Waller, “Compressive light-field microscopy for 3d neural activity recording,” Optica 3, 517–524 (2016).
  • [20] O. Katz, P. Heidmann, M. Fink, and S. Gigan, “Non-invasive single-shot imaging through scattering layers and around corners via speckle correlations,” Nature Photonics 8, 784–790 (2014).
  • [21] E. Edrei and G. Scarcelli, “Memory-effect based deconvolution microscopy for super-resolution imaging through scattering media,” Scientific Reports 6 (2016).
  • [22] A. K. Singh, D. N. Naik, G. Pedrini, M. Takeda, and W. Osten, “Looking through a diffuser and around an opaque surface: A holographic approach,” Optics Express 22, 7694–7701 (2014).
  • [23] N. Antipa, S. Necula, R. Ng, and L. Waller, “Single-shot diffuser-encoded light field imaging,” in “2016 IEEE International Conference on Computational Photography (ICCP),” (2016), pp. 1–11.
  • [24] Y. Kashter, A. Vijayakumar, and J. Rosen, “Resolving images by blurring: superresolution method with a scattering mask between the observed objects and the hologram recorder,” Optica 4, 932–939 (2017).
  • [25] A. Liutkus, D. Martina, S. Popoff, G. Chardon, O. Katz, G. Lerosey, S. Gigan, L. Daudet, and I. Carron, “Imaging with nature: Compressive imaging using a multiply scattering medium,” Scientific reports 4 (2014).
  • [26] E. J. Candès and M. B. Wakin, “An introduction to compressive sampling,” IEEE Signal Processing Magazine 25, 21–30 (2008).
  • [27] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Physica D: Nonlinear Phenomena 60, 259–268 (1992).
  • [28] F. Krahmer, S. Mendelson, and H. Rauhut, “Suprema of chaos processes and the restricted isometry property,” Commun. Pur. Appl. Math. 67, 1877–1904 (2014).
  • [29] A. Beck and M. Teboulle, “Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems,” IEEE Transactions on Image Processing 18, 2419–2434 (2009).
  • [30]

    S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends in Machine Learning

    3, 1–122 (2011).
  • [31] M. S. C. Almeida and M. Figueiredo, “Deconvolving images with unknown boundaries using the alternating direction method of multipliers,” IEEE Transactions on Image processing 22, 3074–3086 (2013).
  • [32] A. Matakos, S. Ramani, and J. A. Fessler, “Accelerated edge-preserving image restoration without boundary artifacts,” IEEE Transactions on Image Processing 22, 2019–2029 (2013).
  • [33] M. V. Afonso, J. M. Bioucas-Dias, and M. A. T. Figueiredo, “Fast image recovery using variable splitting and constrained optimization,” IEEE Transactions on Image Processing 19, 2345–2356 (2010).
  • [34] J. Nocedal and S. J. Wright, Numerical Optimization (Springer, 2006).
  • [35] Y. Wang, J. Yang, W. Yin, and Y. Zhang, “A new alternating minimization algorithm for total variation image reconstruction,” SIAM Journal on Imaging Sciences 1, 248–272 (2008).
  • [36] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe, “Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines,” ACM SIGPLAN Notices 48, 519–530 (2013).
  • [37] G. Kuo, N. Antipa, R. Ng, and L. Waller, “Diffusercam: Diffuser-based lensless cameras,” in “Computational Optical Sensing and Imaging,” (Optical Society of America, 2017), pp. CTu3B–2.