The human eye automatically changes the focus of its lens to provide sharp, in-focus images of objects at different depths. While convenient in the real world, for virtual or augmented reality (VR/AR) applications, this focusing capability of the eye often causes a problem that is called the vergence-accommodation conflict (VAC) [Kramida, 2016; Hua, 2017]. Vergence refers to the simultaneous movement of the two eyes so that a scene point comes into the center of the field of view, and accommodation refers to the changing of the focus of the ocular lenses to bring the object into focus. In the real world, these two cues act in synchrony. However, most commercial VR/AR displays render scenes by only satisfying the vergence cue, i.e., they manipulate the disparity of the images shown to each eye. But given that the display is at a fixed distance from the eyes, the corresponding accommodation cues are invariably incorrect, leading to a conflict between vergence and accommodation that can cause discomfort, fatigue, and distorted 3D perception, especially after long durations of usage [Hoffman et al., 2008; Watt et al., 2005; Vishwanath and Blaser, 2010; Zannoli et al., 2016]. While many approaches have been proposed to mitigate the VAC, it remains one of the important challenges for VR and AR displays.
In this paper, we provide the design for a VR display that is capable of addressing the VAC by displaying content on a dense collection of depth or focal planes. The proposed display falls under the category of multifocal displays, i.e., displays that generate content at different focal planes using a focus-tunable lens [Liu et al., 2008; Liu and Hua, 2009; Love et al., 2009; Llull et al., 2015; Johnson et al., 2016; Konrad et al., 2016]. This change in focal length can be implemented in one of many ways; for example, by changing the curvature of a liquid lens [Optotune, 2017; Varioptic, 2017], the state of a liquid-crystal lens [Jamali et al., 2018b, a], the polarization of a waveplate lens [Tabiryan et al., 2015], or the relative orientation between two carefully designed phase plates [Bernet and Ritsch-Marte, 2008]. The key distinguishing factor is that the proposed device displays a stack of focal planes that are an order of magnitude greater in number as compared to prior work, without any loss in the frame rate of the display. Specifically, our prototype system is capable of displaying 1600 focal planes per second, which can be used to display scenes with 40 focal planes per frame at 40 frames per second. As a consequence, we are able to render virtual worlds at a realism that is hard to achieve with current multifocal display designs.
To understand how our system can display thousands of focal planes per second, it is worth pointing out that the key factor that limits the depth resolution of a multifocal display is the operational speed of its focus-tunable lens. Focus-tunable liquid lenses change their focal length based on an input driving voltage; they typically require around s to settle onto a particular focal length. Hence, in order to wait for the lens to settle so that the displayed image is rendered at the desired depth, we can output at most focal planes per second. For a display operating with 30-60 frames per second (fps), this would imply anywhere between three and six focal planes per frame, which is woefully inadequate.
The proposed display relies on the observation that, while focus-tunable lenses have long settling times, their frequency response is rather broad and has a cut-off upwards of Hz [Optotune, 2017]. This suggests that we can drive the lens with excitations that are radically different from a simple step edge (i.e., a change in voltage). For example, we could make the lens sweep through its entire gamut of focal lengths at a high frequency simply by exciting it with a sinusoid or a triangular voltage of the desired frequency. If we can subsequently track the focal length of the lens in real-time, we can accurately display focal planes at any depth without waiting for the lens to settle. In other words, by driving the focus-tunable lens to periodically sweep the desired range of focal lengths and tracking the focal length at high-speed and in real-time, we can display numerous focal planes.
This paper proposes the design of a novel multifocal display that produces three-dimensional scenes by displaying dense focal stacks. In this context, we make the following contributions:
High-speed focal-length tracking. The core contribution of this paper is a system for real-time tracking of the focal length of a focus-tunable lens at microsecond-scale resolutions. We achieve this by measuring the deflection of a laser incident on the lens.
Design space analysis. Displaying a dense set of focal planes is also necessary for mitigating the loss of spatial resolution due to the defocus blur caused by the ocular lens. To show this, we analytically derive the spatial resolution of the image formed on the retina when there is a mismatch between the focus of the eye and the depth at which the content is virtually rendered. This analysis justifies the need for AR/VR displays capable of a high focal-plane density.
Prototype. Finally, we build a proof-of-concept prototype that is able to produce -bit focal planes per frame with fps. This corresponds to focal planes per second — a capability that is an order of magnitude greater than competing approaches.
In addition to limitations endemic to multifocal displays, the proposed approach has the following limitations:
Need for additional optics. The proposed focal-length tracking device requires additional optics that increase its bulk.
Peak brightness. Displaying a large number of focal planes per frame leads to a commensurate decrease in peak brightness of the display since each depth plane is illuminated for a smaller fraction of time. This is largely not a concern for VR displays, and can potentially be alleviated with techniques that redistribute light [Damberg et al., 2016].
Limitations of our prototype. Our current proof-of-concept prototype uses a digital micromirror display (DMD) and, as a consequence, has low energy efficiency. The problem can be easily solved by switching to energy-efficient displays, like OLED, or laser-scanning projectors or displays that redistribute light to achieve higher peak brightness and contrast.
2. Related work
A typical VR display is composed of a convex eyepiece and a display unit. As shown in Figure 2a, the display is placed within the focal length of the convex lens in order to create a magnified virtual image. The distance of the virtual image can be calculated by the thin lens formula:
where is the distance between the display and the lens, and is the focal length. We can see that is an affine function of the optical power () of the lens and the term . By choosing and , the designer can put the virtual image of the display at the desired depth. However, for many applications, most scenes need to be rendered across a wide range of depths. Due to the fixed focal plane, these displays do not provide natural accommodation cues.
2.1. Accommodation-Supporting Displays
There have been many designs proposed to provide accommodation support. We concentrate on techniques most relevant to the proposed method, deferring a detailed description to [Kramida, 2016] and [Hua, 2017]; in particular, see Table 1 of [Matsuda et al., 2017].
2.1.1. Multifocal and Varifocal Displays
Multifocal and varifocal displays control the depths of the focal planes by dynamically adjusting or in (1). Multifocal displays aim to produce multiple focal planes at different depths for each frame (Figure 2b), whereas varifocal displays support only one focal plane per frame whose depth is dynamically adjusted based on the gaze of the user’s eyes (Figure 2c). Multifocal and varifocal displays can be designed in many ways, including the use of multiple (transparent) displays placed at different depths [Jannick P. Rolland, 1999; Akeley et al., 2004; Love et al., 2009], a translation stage to physically move a display or optics [Shiwa et al., 1996; Sugihara and Miyasato, 1998; Akşit et al., 2017], deformable mirrors [Hu and Hua, 2014], as well as a focus-tunable lens to optically reposition a fixed display [Liu et al., 2008; Padmanaban et al., 2017; Johnson et al., 2016; Konrad et al., 2016; Lee et al., 2018]. Varifocal focal displays show a single focal plane at any point in time, but they require precise eye/gaze-tracking at low latency. Multifocal displays, on the other hand, have largely been limited to displaying a few focal planes per frame due to the limited switching speed of translation stages and focus-tunable lenses. Concurrent to our work, Lee et al.  propose a multifocal display that can also display dense focal stacks with a focus-tunable lens. However, their method can only display any given pixel at a single depth. This prohibits the use of rendering techniques [Akeley et al., 2004; Narain et al., 2015; Mercier et al., 2017] that require a pixel to be potentially displayed at many depths with different contents.
2.1.2. Light Field Displays
While multifocal and varifocal displays produce a collection of focal planes, light field displays aim to synthesize the light field of a 3D scene. Lanman and Luebke  introduce angular information by replacing the eyepiece with a microlens array; Huang et al.  utilize multiple spatial light modulators to modulate the intensity of light rays. While these displays fully support accommodation cues and produce natural defocus blur and parallax, they usually suffer from poor spatial resolution due to the space-angle resolution trade-off.
2.1.3. Other Types of Virtual Reality Displays
Other types of VR/AR displays have been proposed to solve the VAC. Matsuda et al.  use a phase-only spatial light modulator to create spatially-varying lensing based on the virtual content and the gaze of the user. Maimone et al.  utilize a phase-only spatial light modulator to create a 3D scene using holography. Similar to our work, Konrad et al.  operate a focus-tunable lens in an oscillatory mode. Here, they use the focus-tunable lens to create a depth-invariant blur by using a concept proposed for extended depth of field imaging [Miau et al., 2013]. Intuitively, since the content is displayed at all focal planes, the VAC is significantly resolved. However, there is a loss of spatial resolution due to the intentionally introduced defocus blur.
2.2. Depth-Filtering Methods
When virtual scenes are rendered with few focal planes, there are associated aliasing artifacts as well as a reduction of spatial resolution on content that is to be rendered in between focal planes. Akeley et al.  show that such artifacts can be alleviated using linear depth filtering, a method that is known to be quite effective [MacKenzie et al., 2010; Ravikumar et al., 2011]. However, linear depth filtering produces artifacts near object boundaries due to the inability of multifocal displays to occlude light. To produce proper occlusion cues with multifocal displays, Narain et al.  propose a method that jointly optimizes the contents shown on all focal planes. By modeling the defocus blur of focal planes when an eye is focused at certain depths, they formulate a non-negative least-square problem that minimizes the mean-squared error between perceived images and target images at multiple depths. While this algorithm demonstrates promising results, the computational costs of the optimization are often too high for real-time applications. Mercier et al.  simplify the forward model of Narain et al.  and significantly improve the speed to solve the optimization problem. These filtering approaches are largely complementary to the proposed work, in that, they can be incorporated into the dense focal stacks produced by our proposed display.
3. How Many Focal Planes Do We need?
A key factor underlying the design of multifocal displays is the number of focal planes required to support a target accommodation range. In order to be indistinguishable from the real world, a virtual world should enable human eyes to accommodate freely on arbitrary depths. In addition, the virtual world should have high spatial resolution anywhere within the target accommodation range. Simultaneously satisfying these two criteria for a large accommodation range is very challenging, since it requires generating light fields of high spatial and angular resolution. In the following, we will show that displaying a dense focal stack is a promising step toward the ultimate goal of generating virtual worlds that can handle the accommodation cues of the human eye.
To understand the capability of a multifocal display, we can analyze its generated light field in the frequency domain. Our analysis, following the derivation in Wetzstein et al.  and Narain et al. , provides an upper-bound on the performance of a multifocal display, regardless of the depth filtering algorithm applied. It is also similar to that of Sun et al.  with the key difference that we focus on the minimum number of focal planes required to retain spatial resolution within an accommodation range, as opposed to efficient rendering of foveated light fields.
3.1. Light-Field Parameterization and Assumptions
For simplicity, our analysis considers a flatland with two-dimensional light fields. In the flatland, the direction of a light ray is parameterized by its intercepts with two parallel axes, and , which are separated by unit, and the origin of the -axis is relative to each individual value of such that measures the tangent angle of a ray passing through , as shown in Figure 3a. We model the human eye with a camera composed of a finite-aperture lens and a sensor plane away from the lens, following the assumptions made in Mercier et al.  and Sun et al. . We assume that the pupil of the eye is located at the center of the focus-tunable lens and is smaller than the aperture of the tunable lens. We assume that the display and the sensor emits and receives light isotropically. In other words, each pixel on the display uniformly emits light rays toward every direction and vice versa for the sensor. We also assume small-angle (paraxial) scenarios, since the distance and the focal length of the tunable lens (or essentially, the depths of focal planes) are large compared to the diameter of the pupil. This assumption simplifies our analysis by allowing us to consider each pixel in isolation.
3.2. Light Field Generated by the Display
Since the display is assumed to emit light isotropically in angle, the light field created by a display pixel can be modeled as , where is the radiance emitted by the pixel, represents two-dimensional convolution, and is the pitch of the display pixel. The Fourier transform of is , which lies on the axis, as shown in Figure 3b. We only plot the central lobe of corresponding to , since this is sufficient for calculation of the half-maximum bandwidth of retinal images. In the following, we omit the constant for brevity.
3.3. Propagation from Display to Retina
Let us decompose the optical path from the display to the retina (sensor) and examine its effects in the frequency domain. After leaving the display, the light field propagates a distance , gets refracted by the tunable lens, and by the lens of the eye where it is partially blocked by the pupil, whose diameter is , and propagates a distance to the retina where it finally gets integrated across angle. Propagation and refraction shears the spectrum of the light field along and , respectively, as shown in Figure 3(c,d,e). Before entering the pupil, the focal plane at depth forms a segment of slope within , where is due to the magnification of the lens. For brevity, we show only the final (and most important) step and defer the full derivation to the appendix.
Suppose the eye focuses at depth , and the focus-tunable lens configuration creates a focal plane at . The Fourier transform of the light field reaching the retina is
where represents two-dimensional cross correlation, is the Fourier transform of the light field from the focal plane at reaching the retina without aperture (Figure 3f), and is the Fourier transform of the aperture function propagated to the retina (Figure 3g). Depending on the virtual depth , the cross correlation creates different extent of blur on the spectrum (Figure 3h). Finally, the Fourier transform of the image that is seen by the eye is simply the slice along on .
When the eye focuses at the focal plane (), the spectrum lies entirely on and the cross correlation with has no effect on the spectrum along . The resulted retinal image has maximum spatial resolution , which is independent of the depth of the focal plane .
When the eye is not focused on the virtual depth plane, i.e., , the cross correlation results in a segment of width
on the -axis (Figure 3h). Note that , and thereby the half-maximum bandwidth of the spatial frequency of the perceived image is upper-bounded by .
3.4. Spatial Resolution of Retinal Images
We can now characterize the spatial resolution of a multifocal display. Suppose the eye can accommodate freely on any depth within a target accommodation range, . Let be the set of depth of the focal planes created by the multifocal display. When the eye focuses at , the image formed on its retina has spatial resolution of
where the first term characterizes the inherent spatial resolution of the display unit, and the second term characterizes spatial resolution limited by accommodation, i.e. potential mismatch between the focus plane of the eye and the display. This bound on spatial resolution is a physical constraint caused by the finite display pixel pitch and the limiting aperture (i.e., the pupil) — even if the retina had infinitely-high spatial sampling rate. Any post-processing methods including linear depth filtering, optimization-based filtering, and nonlinear deconvolution cannot surpass this limitation.
3.5. Minimum Number of Focal Planes Needed
As can be seen in (3), the maximum spacing between any two focal planes in diopter determines , the lowest perceived spatial resolution within the accommodation range. If we desire a multifocal display with spatial resolution across the accommodation range to be at least , , the best we can do with focal planes is to have a constant inter-focal separation in diopter. This results in an inequality that
Thereby, increasing the number of focal planes (and distributing them uniformly in diopter) is required for multifocal displays to support higher spatial resolution and wider accommodation range.
3.6. Relationship to Prior Work.
There are many prior works studying the minimum focal-plane spacing of multifocal displays. Rolland et al.  compute the depth-of-focus based on typical acuity of human eyes ( cycles per degree) and pupil diameter ( m) and conclude that focal planes equally spaced by diopter are required to accommodate from m to . Both theirs and our analyses share the same underlying principle — maintaining the minimum resolution seen by the eye within the accommodation range, and thereby provide the same required focal planes. By taking m, , m, and , we have , which concurs with their result. MacKenzie et al. [2010; 2012] measure accommodation responses of human eyes during usage of multifocal displays with different plane-separation configurations under linear depth filtering [Akeley et al., 2004]. Their results suggest that focal-plane separations as wide as diopter can drive accommodation with insignificant deviation from the natural accommodation. However, it is also reported that smaller plane-separations provide more natural accommodation and higher retinal contrast — features that are desirable in any VR/AR display. By enabling dense focal stacks of focal-plane separation as small as diopter, our prototype can simultaneously provide proper accommodation cues and display high-resolution images onto the retina.
3.7. Maximum Number of Focal Planes Needed
At the other extreme, if we have a sufficient number of focal planes, the limiting factor becomes the pixel pitch of the display unit. In this scenario, for a focal plane at virtual depth , the retinal image of an eye focuses on will have maximal spatial resolution if
In other words, the depth-of-field of a focal plane — defined as the depth range that under focus provides the maximum resolution — is diopters. Since the maximum accommodation range of the multifocal display with a convex tunable lens is diopter, we need at least focal planes to achieve the maximum spatial resolution of the multifocal display across the maximum supported depth range, or focal planes for a depth range of . For example, our prototype has m, m, and pupil diameter m, it would require focal planes for the maximum possible depth range of m to infinity or diopters to reach the resolution upper-bound. For a shorter working range of 25 m to infinity, or 4 diopters, it would require 41 focal planes.
4. Generating Dense Focal Stacks
We now have a clear goal — designing a multifocal display supporting a very dense focal stack, which enables display high-resolution images across a wide accommodation range. The key bottleneck for building multifocal displays with dense focal stacks is the settling time of the focus-tunable lens. The concept described in this section outlines an approach to mitigate this bottleneck and provides a design template for displaying dense focal stacks.
4.1. Focal-Length Tracking
The centerpiece of our proposed work is the idea that we do not have to wait for the focus-tunable lens to settle at a particular focal length. Instead, if we constantly drive the lens so that it sweeps across a range of focal lengths, and subsequently track the focal length in real time, we can display the corresponding focal plane without waiting for the focus-tunable lens to settle. This enables us to display as many focal planes as we want, as long as the display supports the required frame rate.
While the optical power of focus-tunable lenses is controlled by an input voltage or current, simply measuring these values only provides inaccurate and biased estimates of the focal length. This is due to the time-varying transfer functions of tunable lenses, which are known to be sensitive to operating temperature and irregular motor delays. Instead, we propose to estimate the focal length by probing the tunable lens optically. This enables robust estimations that are invulnerable to the unexpected factors.
In order to measure the focal length, we send a collimated infrared laser beam through the edge of the focus-tunable lens. Since the direction of the outgoing beam depends on the focal length, the laser beam changes direction as the focal length changes. There are many approaches to measure this change in direction, including using a one-dimensional pixel array or an encoder system. In our prototype, we use a one-dimensional position sensing detector (PSD) to enable fast and accurate measurement of the location. The schematic is shown in Figure 4a.
The focal length of the laser is estimated as follows. We first align the laser so that it is parallel to the optical axis of the focus-tunable lens. After deflection by the lens, the beam is incident on a spot on the PSD whose position, as shown in Figure 4b, is given as
where is the focal length of the lens, is the distance measured along the optical axis between the lens and the PSD, and is the distance between the optical center of the lens and the spot the laser is incident on. Note that the displacement is an affine function of the optical power of the focus-tunable lens.
We next discuss how the location of the spot is estimated from the PSD outputs. A PSD is composed of a photodiode and a resistor distributed throughout the active area. The photodiode has two connectors at its anode and a common cathode. Suppose the total length of the active area of the PSD is . When a light ray reaches a point at on the PSD, the generated photocurrent will flow from each anode connector to the cathode with amount inversely proportional to the resistance in between. Since resistance is proportional to length, we have the ratio of the currents in the anode and cathode as
As can be seen, the optical power of the tunable lens is an affine function of . With simple calibration (to get the two coefficients), we can easily estimate the value.
4.2. The Need for Fast Displays
In order to display multiple focal planes within one frame, we also require a display that has a frame rate greater than or equal to the focal-plane display rate. To achieve this, we use a digital micromirror device (DMD)-based projector as our display. Commercially available DMDs can easily achieve upwards of bitplanes per second. Following the design in [Chang et al., 2016], we modulate the intensity of the projector’s light source to display 8-bit images; this enables us to display each focal plane with 8-bits of intensity and generate as many as focal planes per second.
4.3. Design Criteria and Analysis
We now analyze the system in terms of various desiderata and the system configurations required to achieve them.
4.3.1. Achieving a Full Accommodation Range
A first requirement is that the system be capable of supporting the full accommodation range of typical human eyes, i.e., generate focal planes from m to infinity. Suppose the optical power of the focus-tunable lens ranges from to diopter. From (1), we have
where is the distance between the display unit and the tunable lens, is the distance of the virtual image of the display unit from the lens, is the focal length of the lens at time , and is the optical power of the lens in diopter. Since we want to range from cm to infinity, ranges from to . Thereby, we need
An immediate implication of this is that , i.e., to support the full accommodation range of a human eye, we need a focus-tunable lens whose optical power spans at least diopters. We have more choice over the actual range of focal lengths taken by the lens. A simple choice is to set ; this ensures that we can render focal planes at infinity; subsequently, we choose sufficiently large to cover diopters. By choosing a small value of , we can have a small and thereby achieve a compact display.
The proposed display shares the same field-of-view and eye box characteristics with other multifocal displays. The field-of-view will be maximized when the eye is located right near the lens. This will results in a field-of-view of , where is the height (or width) of the physical display (or its magnification image via lensing). When the eye is further away from the lens, the numerical aperture will limit the extent of the field-of-view. Since the apertures of most tunable lenses are small (around 1 m in diameter), we would prefer to put the eye as close as the lens as possible. This can be achieved by embedding the dichroic mirror (the right one in Figure 4a) onto the rim of the lens. For our prototype that will be described in Section 5, we use a 4 system to relay the eye to the aperture of the focus-tunable lens. Our choice of the 4 system enables a -degree field-of-view, limited by the numerical aperture of the lens in the 4 system.
There are alternate implementations of focus tunable lenses that have the potential for providing larger apertures and hence, displays with larger field of views. Bernet and Ritsh-Marte  design two phase plates that produce the phase function of a lens whose focal length is determined by the relative orientation of the plates; hence, we could obtain a large aperture focus tunable lens by rotating one of the phase plates. Other promising solutions to enable large-aperture tunable lensing include the Fresnel and Pancharatnam-Berry liquid crystal lenses [Jamali et al., 2018a, b] and tunable metasurface doublets [Arbabi et al., 2018]. In all of these cases, our tracking method could be used to provide precise estimates of the focal length.
4.3.3. Eye Box
The eye box of multifocal displays are often small, and the proposed display is no exception. Due to the depth difference of focal planes, as the eye shifts, contents on each focal plane shift by different amounts, with the closer ones traverse more than the farther ones. This will leave uncovered as well as overlapping regions at depth discontinuities. Further, the severity of the artifacts depends largely on the specific content being displayed. In practice, we observe that these artifacts are not distracting for small eye movements in the order of few millimeters. This problem can be solved by incorporating an eye tracker, as in Mercier et al. .
4.4. Reduced Maximum Brightness and Energy Efficiency
Key limitations of our proposed design are the reduction in maximum brightness and, depending on the implementation, the energy efficiency of the device. Suppose we are displaying focal planes per frame and frames per second. Each focal plane is displayed for second, which is -times smaller compared to typical VR displays with one focal plane. For our prototype, we use a high power LED to compensate for the reduction in brightness. Further, brightness of the display is not a primary concern since there are no competing ambient lights sources for VR displays.
Energy efficiency of the proposed method also depends on the type of display used. For our prototype, since we use a DMD to spatially modulate the intensity at each pixel, we waste of the energy. This can be completely avoided by adopted by using OLED displays, where a pixel can be completely turned off. An alternate solution is to use a phase spatial light modulator (SLM) [Damberg et al., 2016] to spatially redistribute a light source so that each focal plane only gets illuminated at pixels that need to be displayed; a challenge here is the slow refresh rate of the current crop of phase SLMs. Another option is to use a laser along with a 2D galvo to selectively illuminate the content at each depth plane; however, 2D galvos are often slow when operated in non-resonant modes.
5. Proof-of-Concept Prototype
In this section, we present a lab prototype that generates a dense focal stack using high-speed tracking of the focal length of a tunable lens and a high-speed display.
5.1. Implementation Details
The prototype is composed of three functional blocks: the focus-tunable lens, the focal-length tracking device, and a DMD-based projector. All the three components are controlled by an FPGA (Altera DE0-nano-SOC). The FPGA drives the tunable lens with a digital-to-analog converter (DAC), following Algorithm 1. Simultaneously, the FPGA reads the focal-length tracking output with an analog-to-digital converter (ADC) and uses the value to trigger the projector to display the next focal plane. Every time a focal plane has been displayed, the projector is immediately turned off to avoid blur caused by the continuously changing focal-length configurations. A photo of the prototype is shown in Figure 5. In the following, we will introduce each component in detail.
Thereby, we can estimate the current depth if we know and , which only requires two measurements to estimate. With a camera focused at m and , we get the two corresponding ADC readings and . The two points can be accurately measured, since the depth-of-field of the camera at m is very small, and infinity can be approximated as long as the image is far away. Since (10) has an affine relationship, we only need to divide evenly into the desired number of focal planes.
5.1.2. Control Algorithm.
The FPGA follows Algorithm 1 to coordinate the tunable lens and the projector. On a high level, we drive the tunable lens with a triangular wave by continuously increasing/decreasing the DAC levels. We simultaneously detect the PSD’s DAC reading to trigger the projection of focal planes. When the last/first focal plane is displayed, we switch the direction of the waveform. Note that while Algorithm 1 is written in serial form, every module in the FPGA runs in parallel.
The control algorithm is simple yet robust. It is known that the transfer function of the tunable lens is sensitive to many factors, including device temperature and unexpected motor delay and errors [Optotune, 2017]. In our experience, even with the same input waveform, we observe different offsets, peak-to-peak values on the PSD output waveform for each period. Since the algorithm does not drive the tunable lens with fixed DAC values and instead directly detect the PSD output (i.e., the focal length of the tunable lens), it is robust to these unexpected factors. However, the robustness comes with a price. Due to the motor delay, the peak-to-peak value is often a lot larger than . This causes the frame rate of the prototype ( focal planes per second, or focal planes per frame at fps) to be lower than the highest display frame rate ( focal planes per second).
Note that since 40 fps is close to the persistence of vision, our prototype sometimes leads to flickering. However, the capability of the proposed device is to increase the number of focal planes per second and as such we can get higher frame rate by trading off the focal planes per frame. For example, we can achieve 60 fps by operating at 26 focal planes per frame.
5.1.3. Focus-Tunable Lens and its Driver
We use the focus-tunable lens EL-10-30 from Optotune [Optotune, 2017]. The optical power of the lens ranges from approximately to diopters and is an affine function of the driving current input from to mA. We use a 12-bit DAC (MCP4725) with a current buffer (BUF634) to drive the lens. The DAC provides thousand samples per second, and the current buffer has a bandwidth of MHz. This allows us to faithfully create a triangular input voltage up to several hundred Hertz. The circuit is drawn in Figure 6b.
5.1.4. Focal-Length Tracking and Processing
The focal-length tracking device is composed of a one-dimensional PSD (SL15 from OSI Optoelectronics), two 800 m dichroic short-pass mirrors (Edmundoptics #69-220), and a 980 m collimated infrared laser (Thorlabs CPS980S). We drive the PSD with a reverse bias voltage of V. This enables us to have m precision on the PSD surface and rise time of s. Across the designed accommodation range, the laser spot traverses within m on the PSD surface, which has a total length m. This allows us to accurately differentiate up to focal-length configurations.
The analog processing circuit has three stages — amplifier, analog calculation, and an ADC, as shown in Figure 6a. We use two operational amplifiers (TI OPA-37) to amplify the two output current of the PSD. The gain-bandwidth of the amplifiers are MHz, which can fully support our desired operating speeds. We also add a low-pass filter with a cut-off frequency of kHz at the amplifier, as a denoising filter. The computation of is conducted with two operational amplifiers (TI OPA-37) and an analog divider (TI MPY634). We use a 12-bit ADC (LTC2308) with a rate of thousand samples per second to port the analog voltage to the FPGA.
Overall, the latency of the focal-length tracking circuit is s. The bottleneck is the low-pass filter and the ADC; rest of the components have time responses in nanoseconds. Note that in s the focal length of the tunable lens changes by diopters — well below the detection capabilities of the eye [Campbell, 1957]. Also, the stability of the acquired focal stack (which took a few hours to capture) indicates that the latency was either minimal or at least predictable and can be dealt with by calibration.
5.1.5. DMD-based Projector
The projector is composed of a DLP-7000 DMD from Texas Instruments, projection optics from Vialux, and a high-power LED XHP35A from Cree. We control the DMD with a development module Vialux V-7000. We update the configuration of micro-mirrors every s. Following Chang et al. , we use pulse-width modulation, performed through a LED driver (TI LM3409HV), to change the intensity of the LED concurrently with the update of micro-mirrors. This enables us to display at most 8-bit images per second.
For simplicity, we preload each of the 40 focal planes onto the development module. Each focal stack requires bitplanes, and thereby, we can store up to focal stacks on the module. The lack of video-streaming capability needs further investigation to make it practical; it could potentially be resolved by using the customized display controller in [Lincoln et al., 2016, 2017] that is capable of displaying bitplanes with 80 s latency. This would enable us to display 8-bit focal planes per second. We also note that whether we use depth filtering or not, the transmitted bitplanes are sparse since each pixel has content, at best, at a few depth planes. Thereby, we do not need to transmit the entire bitplanes.
Note that we divide the 8 bitplanes of each focal planes into two groups of 4 bitplanes, and we display the first group when the triangular waveform is increasing, and the other at the downward waveform. From the results that will be presented in Section 6, we can see that the images of the two groups align nicely. This demonstrates the high accuracy of the focal-length tracking.
As a quick verification of the prototype, we used the burst mode on the Nikon camera to capture multiple photographs at an aperture of , ISO 12,800 and an exposure time of s. Figure 7 shows six examples of displayed focal planes. Since a single focal plane requires an exposure time of s, the captured images are composed of at most focal planes.
6. Experimental Evaluations
We showcase the performance of our prototype on a range of scenes designed carefully to highlight the important features of our system. The supplemental material has video illustrations that contain full camera focus stacks of all results in this section.
6.1. Focal-Length Tracking
To evaluate the focal-length tracking module, we measure the input signal to the focus-tunable lens and the PSD output from an Analog Discovery oscilloscope. The measurements are shown in Figure 8. As can be seen, the output waveform matches that of the input. The high bandwidth of the PSD and the analog circuit enables us to track the focal length robustly in real-time. From the figure, we can also observe the delay of the focus-tunable lens ( s).
6.2. Depths of Focal Planes
As stated previously, measuring the depth of the displayed focal planes is very difficult. Thereby, we use a method similar to depth-from-defocus to measure their depths. When a camera is focusing at infinity, the defocus blur kernel size will be linearly dependent on the depth of the (virtual) object in diopter. This provides a method to measure the depths of the focal planes.
For each of the focal plane, we display a pixels white spot at the center, capture multiple images of various exposure time, and average the images to reduce noise. We label the diameter of the defocus blur kernels and show the results in Figure 9. As can be seen, when the blur-kernel diameters can be accurately estimated, i.e., largely defocus spots on closer focal planes, the values fit nicely to a straight line, indicating the depths of focal planes are uniformly separated in diopter. However, as the displayed spot size as a spot come into focus, the estimation of blur kernel diameters becomes inaccurate since we cannot display an infinitesimal spot due to the finite pixel pitch of the display. Since there were no special treatments to individual planes in terms of system design or algorithm, we expect these focal planes to be placed accurately as well.
6.3. Characterizing the System Point-Spread Function
To characterize our prototype, we measure its point spread function with a Nikon D3400 using a m prime lens. We display a static scene that is composed of spots with each spot at a different focal plane. Using the camera, we capture a focal stack of images ranging from to diopters away from the focus-tunable lens. For improved contrast, we remove the background and noise due to dust and scratches on the lens by capturing the same focal stack with no spot shown on the display. Figure 10 shows the point spread function of the display at four different focus settings, and a video of this focal stack is attached in the supplemental material. The result shows that the prototype is able to display the spots at depths concurrently within a frame, verifies the functionality of the proposed method. The shape and the asymmetry of the blur kernels can be attributed to the spherical aberration of the focus-tunable lens as well as the throw of the projection lens on the DMD.
th and on the odd focal planes of the 40-plane display, respectively. (a) Camera focuses at the 5th focal plane. (b,c) Cameras focus at the estimated inter-plane locations of the 40-plane display and the 30-plane displays, respectively. (d) Camera focuses at the 6th focal plane, an inter-plane location of a 20-plane display. (e) Camera focuses at the 10th focal plane, an inter-plane location of a 4-plane display. Their modulation transfer functions are plotted in (f).
6.4. Benefits of Dense Focal Stacks
To evaluate the benefit provided by dense focal stacks, we simulate two multifocal displays, one with 4 focal planes and the other with 40 focal planes. The 40 focal planes are distributed uniformly in diopter from 0 to 4 diopters, and the 4-plane display has focal planes at the depth of the 5th, 15th, 25th, and 35th focal planes of the 40-plane display. The scene is composed of 28 resolution charts, each at a different depth from 0 to 4 diopters (please refer to the supplemental material for figures of the entire scene). The dimension of the scene is pixels.
We render the scene with three methods:
No depth filtering: We directly quantize the depth channel of the images to obtain the focal planes of different depths.
Linear depth filtering: Following [Akeley et al., 2004], we apply a triangular filter on the focal planes based on their depths.
Optimization-based filtering: We follow the formulation proposed in [Mercier et al., 2017]
. We first rendered normally the desired retinal images focused at 81 depths uniformly distributed across 0 to 4 diopters in the scene with a pupil diameter of 4 mm. Then we solve the optimization problem to get the content to be displayed on the focal planes. We initialize the optimization process with the results of direct quantization and perform gradient descent withiterations to ensure convergence.
The perceived images of the resolution chart at diopters are shown in Figure 11; a plane at diopters is on a focal plane of the 40-plane display and is at the furthest inter-focal plane of the 4-plane display. Note that we simulate the results with pupil diameter of m, which is a typical value used to simulated retinal images of human eyes.
As can be seen from the results, the perceived images of the 40-plane display closely follow those of the ground truth — with high spatial resolution if the camera is focused on the plane (Figure 11a) and natural retinal blur when the camera is not focused (Figure 11b). In comparison, at its inter-plane location (Figure 11a), the 4-plane display has much lower spatial resolution than the other display, regardless of the depth filtering methods applied. These results verify our analysis in Section 3.
To evaluate the benefit provided by dense focal stacks in providing higher spatial resolution when the eye is focused at an inter-plane location, we implement four multifocal displays with 4, 20, 30 and 40 focal planes, respectively, on our prototype. The 4-plane display has its focal planes on the th focal planes of the 40-plane display, and the 20-plane display has its focal planes on all the odd-numbered focal planes. We display a resolution chart on the fifth focal plane of the 40-plane display; this corresponds to a depth plane that all three displays can render.
To compare the worst-case scenario where an eye focuses on an inter-plane location, we focus the camera at the middle of two consecutive focal planes of each of the displays. In essence, we are reproducing the effect of VAC where the vergence cue forces the ocular lens to focus on an inter-focal plane. For the 40-plane display, this is between focal planes five and six. For the 20-plane display, this is on the sixth focal plane of the 40-plane display. And for the 4-plane display, this is on the tenth focal plane of the 40-plane display. We also focus the camera on the estimated inter-plane location of a 30-plane display. The results captured by a camera with a m lens are shown in Figure 12. As can be seen, the higher number of focal planes (smaller focal-plane separation) results in higher spatial resolution at inter-plane locations.
Next, we compare our prototype with a 4-plane multifocal display on a real scene. Note that we implement the 4-plane multifocal display with our 40-plane prototype by showing contents on the th focal planes. The images captured by the camera are shown in Figure 13. For the 4-plane multifocal display, when used without linear depth filtering, virtual objects at multiple depths are focus/defocus as groups; when used with linear depth filtering, same objects appearing in two focal planes reduces the visibility and thereby lowers the resolution of the display. In comparison, the proposed method produces smooth focus/defocus cues across the range of depths, and the perceived images at inter-plane locations (e.g. m) have higher spatial resolution than the 4-plane display.
Finally, we render a more complex scene eMirage using Blender. From the rendered all-in-focus image and its depth map, we perform linear filtering and display the results with the prototype. Focus stack images captured using a camera are shown in Figure 14. We observe realistic focus and defocus cues in the captured images.
This paper provides a simple but effective technique for displaying virtual scenes that are made of a dense collection of focal planes. Despite the bulk of our current prototype, the proposed tracking technique is fairly straightforward and extremely amenable to miniaturization. We believe that the system proposed in the paper for high-speed tracking could spur innovation in not just virtual and augmented reality systems but also in traditional light field displays.
Acknowledgements.The authors acknowledge support via the NSF CAREER grant CCF-1652569 and a gift from Adobe Research.
- Akşit et al.  Kaan Akşit, Ward Lopes, Jonghyun Kim, Peter Shirley, and David Luebke. 2017. Near-eye Varifocal Augmented Reality Display Using See-through Screens. ACM Transactions on Graphics (TOG) 36, 6 (2017), 189:1–189:13.
- Akeley et al.  Kurt Akeley, Simon J Watt, Ahna Reza Girshick, and Martin S Banks. 2004. A Stereo Display Prototype with Multiple Focal Distances. ACM Transactions on Graphics (TOG) 23, 3 (2004), 804–813.
- Arbabi et al.  Ehsan Arbabi, Amir Arbabi, Seyedeh Mahsa Kamali, Yu Horie, MohammadSadegh Faraji-Dana, and Andrei Faraon. 2018. MEMS-tunable Dielectric Metasurface Lens. Nature Communications 9, 1 (2018), 812.
- Bernet and Ritsch-Marte  Stefan Bernet and Monika Ritsch-Marte. 2008. Adjustable Refractive Power From Diffractive Moiré Elements. Applied Optics 47, 21 (2008), 3722–3730.
- Campbell  Fergus W Campbell. 1957. The Depth of Field of the Human Eye. Optica Acta: International Journal of Optics 4, 4 (1957), 157–164.
- Chang et al.  Jen-Hao Rick Chang, BVK Vijaya Kumar, and Aswin C Sankaranarayanan. 2016. Shades of Gray: High Bit-depth Projection using Light Intensity Control. Optics Express 24, 24 (2016), 27937–27950.
- Damberg et al.  Gerwin Damberg, James Gregson, and Wolfgang Heidrich. 2016. High brightness HDR projection using dynamic freeform lensing. ACM Transactions on Graphics (TOG) 35, 3 (2016), 24:1–24:11.
- eMirage  eMirage. 2017. Barcelona Pavillion. https://download.blender.org/demo/test/pabellon_barcelona_v1.scene_.zip.
- Hecht  Eugene Hecht. 2002. Optics. Addison-Wesley.
- Hoffman et al.  David M Hoffman, Ahna R Girshick, Kurt Akeley, and Martin S Banks. 2008. Vergence-accommodation Conflicts Hinder Visual Performance and Cause Visual Fatigue. Journal of Vision 8, 3 (2008), 33.
- Hu and Hua  Xinda Hu and Hong Hua. 2014. High-Resolution Optical See-Through Multi-focal-plane Head-mounted Display Using Freeform Optics. Optics Express 22, 11 (2014), 13896–13903.
- Hua  Hong Hua. 2017. Enabling Focus Cues in Head-mounted Displays. Proc. IEEE 105, 5 (2017), 805–824.
- Huang et al.  Fu-Chung Huang, Kevin Chen, and Gordon Wetzstein. 2015. The Light Field Stereoscope: Immersive Computer Graphics via Factored Near-eye Light Field Displays with Focus Cues. ACM Transactions on Graphics (TOG) 34, 4 (2015), 60:1–60:12.
- Jamali et al. [2018a] Afsoon Jamali, Douglas Bryant, Yanli Zhang, Anders Grunnet-Jepsen, Achintya Bhowmik, and Philip J Bos. 2018a. Design of a Large Aperture Tunable Refractive Fresnel Liquid Crystal Lens. Applied Optics 57, 7 (2018), B10–B19.
- Jamali et al. [2018b] Afsoon Jamali, Comrun Yousefzadeh, Colin McGinty, Douglas Bryant, and Philip Bos. 2018b. A Continuous Variable Lens System to Address the Accommodation Problem in VR and 3D Displays. In Imaging and Applied Optics. 3Tu2G.5.
- Jannick P. Rolland  Alexei A. Goon Jannick P. Rolland, Myron W. Krueger. 1999. Dynamic Focusing in Head-mounted Displays. Proceeding of SPIE 3639 (1999), 3639–3639–8.
- Johnson et al.  Paul V Johnson, Jared AQ Parnell, Joohwan Kim, Christopher D Saunter, Gordon D Love, and Martin S Banks. 2016. Dynamic Lens and Monovision 3D Displays to Improve Viewer Comfort. Optics Express 24, 11 (2016), 11808–11827.
- Konrad et al.  Robert Konrad, Emily A Cooper, and Gordon Wetzstein. 2016. Novel Optical Configurations for Virtual Reality: Evaluating User Preference and Performance with Focus-tunable and Monovision Near-eye Displays. In Conference on Human Factors in Computing Systems (CHI). 1211–1220.
- Konrad et al.  Robert Konrad, Nitish Padmanaban, Keenan Molner, Emily A Cooper, and Gordon Wetzstein. 2017. Accommodation-invariant Computational Near-eye Displays. ACM Transactions on Graphics (TOG) 36, 4 (2017), 88:1–88:12.
- Kramida  Gregory Kramida. 2016. Resolving the Vergence-accommodation Conflict in Head-mounted Displays. IEEE Transactions on Visualization and Computer Graphics 22, 7 (2016), 1912–1931.
- Lanman and Luebke  Douglas Lanman and David Luebke. 2013. Near-eye Light Field Displays. ACM Transactions on Graphics (TOG) 32, 6 (2013), 220:1–220:10.
- Lee et al.  Seungjae Lee, Youngjin Jo, Dongheon Yoo, Jaebum Cho, Dukho Lee, and Byoungho Lee. 2018. TomoReal: Tomographic Displays. arXiv:1804.04619 (2018).
- Lincoln et al.  Peter Lincoln, Alex Blate, Montek Singh, Andrei State, Mary C. Whitton, Turner Whitted, and Henry Fuchs. 2017. Scene-adaptive High Dynamic Range Display for Low Latency Augmented Reality. In Proceedings of the 21st ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games.
- Lincoln et al.  Peter Lincoln, Alex Blate, Montek Singh, Turner Whitted, Andrei State, Anselmo Lastra, and Henry Fuchs. 2016. From Motion to Photons in 80 Microseconds: Towards Minimal Latency for Virtual and Augmented Reality. Transactions on Visualization and Computer Graphics 22, 4 (2016), 1367–1376.
- Liu et al.  Sheng Liu, Dewen Cheng, and Hong Hua. 2008. An Optical See-through Head Mounted Display with Addressable Focal Planes. In IEEE/ACM International Symposium on Mixed and Augmented Reality. 33–42.
- Liu and Hua  Sheng Liu and Hong Hua. 2009. Time-multiplexed Dual-focal Plane Head-mounted Display with a Liquid Lens. Optics Letters 34, 11 (2009), 1642–1644.
- Llull et al.  Patrick Llull, Noah Bedard, Wanmin Wu, Ivana Tosic, Kathrin Berkner, and Nikhil Balram. 2015. Design and Optimization of a Near-eye Multifocal Display System for Augmented Reality. In Imaging and Applied Optics. JTH3A.5.
- Love et al.  Gordon D Love, David M Hoffman, Philip JW Hands, James Gao, Andrew K Kirby, and Martin S Banks. 2009. High-speed Switchable Lens Enables the Development of a Volumetric Stereoscopic Display. Optics Express 17, 18 (2009), 15716–15725.
- MacKenzie et al.  Kevin J MacKenzie, Ruth A Dickson, and Simon J Watt. 2012. Vergence and Accommodation to Multiple-image-plane Stereoscopic Displays: ”Real World” Responses with Practical Image-plane Separations? Journal of Electronic Imaging 21 (2012), 21–21–9.
- MacKenzie et al.  Kevin J MacKenzie, David M Hoffman, and Simon J Watt. 2010. Accommodation to Multiple-focal-plane Displays: Implications for Improving Stereoscopic Displays and for Accommodation Control. Journal of Vision 10, 8 (2010), 22.
- Maimone et al.  Andrew Maimone, Andreas Georgiou, and Joel S Kollin. 2017. Holographic Near-eye Displays for Virtual and Augmented Reality. ACM Transactions on Graphics (TOG) 36, 4 (2017), 85:1–85:16.
- Matsuda et al.  Nathan Matsuda, Alexander Fix, and Douglas Lanman. 2017. Focal Surface Displays. ACM Transactions on Graphics (TOG) 36, 4 (2017), 86:1–86:14.
- Mercier et al.  Olivier Mercier, Yusufu Sulai, Kevin Mackenzie, Marina Zannoli, James Hillis, Derek Nowrouzezahrai, and Douglas Lanman. 2017. Fast Gaze-contingent Optimal Decompositions for Multifocal Displays. ACM Transactions on Graphics (TOG) 36, 6 (2017), 237:1–237:15.
- Miau et al.  Daniel Miau, Oliver Cossairt, and Shree K Nayar. 2013. Focal Sweep Videography with Deformable Optics. In IEEE Conference on Computational Photography (ICCP).
- Narain et al.  Rahul Narain, Rachel A Albert, Abdullah Bulbul, Gregory J Ward, Martin S Banks, and James F O’Brien. 2015. Optimal Presentation of Imagery with Focus Cues on Multi-plane Displays. ACM Transactions on Graphics (TOG) 34, 4 (2015), 59:1–59:12.
- Optotune  Optotune. 2017. Optotune Electrically Tunable Lens EL-10-30. http://www.optotune.com/images/products/Optotune.
- Padmanaban et al.  Nitish Padmanaban, Robert Konrad, Tal Stramer, Emily A Cooper, and Gordon Wetzstein. 2017. Optimizing Virtual Reality for All Users Through Gaze-contingent and Adaptive Focus Displays. Proceedings of the National Academy of Sciences 114, 9 (2017), 2183–2188.
- Ravikumar et al.  Sowmya Ravikumar, Kurt Akeley, and Martin S Banks. 2011. Creating Effective Focus Cues in Multi-plane 3D Displays. Optics Express 19, 21 (2011), 20940–20952.
- Shiwa et al.  Shinichi Shiwa, Katsuyuki Omura, and Fumio Kishino. 1996. Proposal for a 3-D Display with Accommodative Compensation: 3DDAC. Journal of the Society for Information Display 4, 4 (1996), 255–261.
- Sugihara and Miyasato  Toshiaki Sugihara and Tsutomu Miyasato. 1998. System Development of Fatigue-less HMD System 3DDAC (3D Display with Accommodative Compensation: System implementation of Mk. 4 in Light-weight HMD. In ITE Technical Report 22.1. 33–36.
- Sun et al.  Qi Sun, Fu-Chung Huang, Joohwan Kim, Li-Yi Wei, David Luebke, and Arie Kaufman. 2017. Perceptually-guided Foveation for Light Field Displays. ACM Transactions on Graphics (TOG) 36, 6 (2017), 192:1–192:13.
- Tabiryan et al.  Nelson V Tabiryan, Svetlana V Serak, David E Roberts, Diane M Steeves, and Brian R Kimball. 2015. Thin Waveplate Lenses of Switchable Focal Length–New Generation in Optics. Optics express 23, 20 (2015), 25783–25794.
- Varioptic  Varioptic. 2017. Varioptic Variable Focus Liquid Lens ARCTIC 25H. http://varioptic.com/media/cms_page_media/45/MADS_-_160429_-_Arctic_25H_family.pdf.
- Vishwanath and Blaser  Dhanraj Vishwanath and Erik Blaser. 2010. Retinal Blur and the Perception of Egocentric Distance. Journal of Vision 10, 10 (2010), 26.
- Watt et al.  Simon J Watt, Kurt Akeley, Marc O Ernst, and Martin S Banks. 2005. Focus Cues Affect Perceived Depth. Journal of Vision 5, 10 (2005), 7.
- Wetzstein et al.  Gordon Wetzstein, Douglas Lanman, Wolfgang Heidrich, and Ramesh Raskar. 2011. Layered 3D: Tomographic Image Synthesis for Attenuation-based Light Field and High Dynamic Range Displays. ACM Transactions on Graphics (TOG) 30, 4 (2011), 95:1–95:12.
- Zannoli et al.  Marina Zannoli, Gordon D Love, Rahul Narain, and Martin S Banks. 2016. Blur and the Perception of Depth at Occlusions. Journal of Vision 16, 6 (2016), 17.
Appendix A Light Field Analysis
This section provides a detailed derviation of the analysis discussed in Section 3 of the main paper in detail. This analysis follows closely to the one in [Narain et al., 2015]. A notable difference however is that we provide analytical expressions for the perceived spatial resolution (Equation (3) in the main paper) and the minimum number of focal planes required (Equation (5)), whereas they only provide numerical results. For simplicity, we consider a flatland where a light field is two-dimensional and is parameterized by intercepts with two parallel axes, and . The two axes are separated by unit, and for each , we align the origin of -axis to . We model the human eye with a camera model that is composed of a finite-aperture lens and a sensor plane away from the lens, as that used by Mercier et al.  and Sun et al.  We assume that the display and the sensor emits and receives light isotropically so that each pixel on the display uniformly emits light rays toward every direction, and vice versa for the sensor.
Light Field Generated by a Display
Let us decompose the optical path from the display to the retina (sensor) and examine the effect in frequency domain due to each component. Due to the finite pixel pitch, the light field creates by the display can be model as
where represents two-dimensional convolution, is the pitch of the display pixel, and is the target light field. The Fourier transform of is
The finite pixel pitch acts as an anti-aliasing filter and thus we consider only the central spectrum replica (). Also, we assume for all to avoid aliasing. Since the light field is nonnegative, or , we have . Therefore, we have
Therefore, in the ensuing derivation, we will focus on the upper-bound
The light field spectrum forms a line segment parallel to , as plotted in Figure 15a.
Propagation to the eye
After leaving the display, the light field propagates and get refracted by the focus-tunable lens before reaching the eye. Under first-order optics, there operations can be modeled by coordinate transformation of the light fields [Hecht, 2002]. Let . After propagating a distance , the output light field is a reparameterization of the input light field and can be represented as
After refracted by a thin lens with focal length , the output light field right after the lens is
Since and are invertible, we can use the stretch theorem of -dimensional Fourier transform to analyze their effect in the frequency domain. The general stretch theorem states that: Let , be the Fourier transform operator, and
be any invertible matrix. We have
where is the Fourier transform of , is the variable in frequency domain, represents determinant of , and . By applying the stretch theorem to and , we can see that propagation and refraction shears the Fourier transform of the light field along and , respectively, as shown in Figure 15c-d.
Light Field Incident on the Retina
After reaching the eye, the light field is partially blocked by the pupil, refracted by the lens of the eye, propagates to the retina, and finally integrated through all directions to form an image. The light field reaching the retina can be represented as
and is the diameter of the pupil. To understand the effect of the aperture, we analyze a more general situation where the light field is multiplied with a general function and transformed by an invertible with unit determinant. By multiplication theorem, we have
where we use a change of variable by setting , and the last equation holds because . Equation (13) relates the effect of the aperture directly to the output light field at the retina: The spectrum of the output light field is the cross correlation between the transformed (refracted and propagated) input spectrum with full aperture and the transformed spectrum of the aperture function. The result is important since it significantly simplifies our analysis, and as a result, we are able to derive an analytical expression of spatial resolution and number of focal planes needed.
In our scenario, we have . For a virtual display at , is a line segment of slope within , where is the magnified pixel pitch. According to Equation (13), is simply the cross correlation of and . After transformation, is a line segment of slope , where . Similarly, is a line segment with slope within . Note that we only consider because the cross-correlation result at the boundary has value . Since function is monotonically decreasing for , the half-maximum spectral bandwidth () must be within the region. Let the depth the eye is focusing at be . We have . When , we can see from the above expression that is a flat segment within , where is the overall magnification caused by the focus-tunable lens and the lens of the eye. From Fourier slice theorem, we know that the spectrum of the image is simply the slice along . In this case, the aperture has no effect to the final image, since the cross correlation does not extend or reduce the spectrum along , and the final image has the highest spatial resolution .
Suppose the eye does not focus on the virtual display, or . In the case of a full aperture (), the resulted image will be a constant DC term (completely blurred) because the slice along is a delta function at . In the case of finite aperture diameter , with a simple geometric derivation (see Figure 15h), we can show by simple geometry that the bandwidth of the -slice of , or equivalently, the region , is bounded by . And we have
Thereby, based on Fourier slice theorem, the bandwidth of the retinal images is bounded by .
Appendix B Other Discussions
Color display can be implemented by using a three color LED and cycling through them using time division multiplexing. This would lead to loss in time-resolution or focal stack resolution by a factor of . This loss in resolution can be completely avoided with OLED-based high speed displays since each group of pixels automatically generate the desired image at each focal stack.
Stereo virtual display
The proposed method can be extended to support stereo virtual reality displays. The most straight-forward method is to use two sets of the prototypes, one for each eye. Since all focal planes are shown in each frame, there is no need to synchronize the two focus-tunable lenses. It is also possible to create a stereo display with a single focus tunable lens and a single tracking module; the design for this is shown in Figure 16. This design trades half of the focal planes to support stereo, and thereby, only requires one set of the prototype and additional optics. Polarization is used to ensure that each eye only sees the scene that is meant to see.
Appendix C Simulated Scene
Figure 17 shows the simulated images of Figure 11 in the paper with full field-of-view. There are 28 resolution charts located at various depths from 0 to 4 diopters (as indicated by beneath each of them). In the figure, we plot the ground-truth rendered images and simulated retinal images when focused on 0.02 diopters and 0.9 diopters. Rest of the focus stack can be seen in the supplemental video.