Virtual and augmented reality (VR and AR) are poised to transform computer-mediated communication spanning an exciting range of applications in science, engineering, medicine, arts, entertainment, commerce, and several other areas. While impressive, the current-generation VR and AR systems are unable to match the visual fidelity of our real-world experiences along several dimensions, including resolution, the field of view, and latency. Delivering such a high-quality visual experience in real-time requires enormous computational resources beyond the recent improvements in the hardware and software systems for rendering.
A very small fraction of the scene that is projected on the fovea, the center of the human retina, is perceived by the human visual system (HVS) at its finest details. Visual acuity is the ability to observe detail and is measured as the grating resolution in cycles/degrees. Studies show that though HVS has a wide field of view (FOV), the region that has the highest visual acuity, also known as the foveal region, covers only the central ° of the visual field (Levin et al., 2011). Foveation refers to a decrease in the acuity with angular distance in the human visual system (Guenter et al., 2012). Foveated rendering leverages this feature of the human visual system to selectively render a small fraction of the graphics frame in fine detail. Foveated rendering allows us to reconcile the mutually conflicting goals of high visual realism, interactive frame rates, and low power consumption on modern VR and AR devices. As foveated rendering becomes an integral part of future AR/VR devices, it is important to understand the current state and the observed trends in the field.
The concept of foveated displays is not new. Early implementations of the gaze-contingent displays find applications in building flight simulators (Murphy and Duchowski, 2001). However, early methods were developed mainly for desktop monitors, where the FOV was approximately ° ° (Murphy and Duchowski, 2001). This is a fraction of modern wide FOV displays. Patney et al. (2016) give a comparison of the percentage of pixels that lie in the peripheral region across various devices ranging from a small FOV smartphone screen to a wide FOV VR HMD, as shown in Figure 1. Thus, foveated rendering becomes even more important for VR and AR headsets compared with traditional displays. In a ° wide FOV HMD, only 4% of the screen pixels lie in the foveal region while the rest lie in the peripheral region (Patney et al., 2016). When such a large percentage of pixels lie in the low acuity peripheral region, efficient rendering systems can obtain significant computational gains by allocating fewer resources to render the peripheral region.
Until recently, real-time eye-tracking was not commonly available in VR and AR headsets. However, with the commoditization of eye-trackers in VR and AR headsets, foveated rendering techniques can use the gaze information to dynamically render the foveal region at a higher quality and the rest at a lower quality without a drop in the overall perceived quality of the image. As the FOV of future HMDs increases to match that of the human vision, foveated rendering will emerge as an essential component of the real-time rendering 3D graphics pipeline for large-scale, wide FOV and high-resolution VR and AR displays. 4D light fields is another area that can benefit from the foveated rendering techniques. 4D light fields represent a scene from multiple camera positions, which incur a huge rendering cost. Sun et al. (2017) and Meng et al. (2020a) present methods that use foveated rendering techniques to accelerate the rendering process of light fields, thus enabling real-time light field rendering.
Recently, a classification of foveated displays has been presented by Spjut et al. (2020)
. Their classification categorizes foveated rendering systems based on the resolution distribution function and gaze contingency. The resolution distribution function measures how well the system’s acuity distribution matches the pre-measured visual acuity for the user (as measured by the Snellen eye chart). Gaze contingency measures how the system adapts to changes in the gaze direction. This paper focuses on foveated rendering methods for VR/AR systems. We consider different attributes that characterize foveated rendering, including several visual factors, resolution distribution, foveation space, gaze information, and anti-aliasing response. Further, we have classified several influential research papers along these dimensions. A similar review is also provided in a concurrent work by Mohantoet al. (Mohanto et al., 2022) with a different taxonomy of foveated rendering methods.
We organize our paper as follows. Section 2 provides a brief overview of the human visual system and its limitations that enable foveated rendering. A taxonomy of various foveated rendering methods based on their salient characteristics is presented in Section 3. Section 4 discusses the methods used to evaluate the perceived quality of the foveated image. Finally, some concluding remarks along with the challenges and possible future research directions are presented in Section 5.
2. A brief overview of Human Visual System
Understanding human perception and its limitations is beneficial to improve the rendering quality and efficiency. The eye and the brain make up the human visual system. While the eye serves as a camera, the brain is responsible for processing the information. When a light ray hits the eye, it first passes through the cornea and undergoes refraction. The refracted ray passes through the aqueous humor, the iris, the lens, the vitreous humor, and finally, reaches the retina - the image sensor of the human eye. Figure 2 shows the complete anatomy of the human eye. The light photons reaching the retina are then detected and converted to electrical signals by photoreceptors. The information from the photoreceptors is transmitted to the central neural system by ganglion cells for further processing.
It is well known that not all the information that enters the eye reaches the brain. A significant amount of compression occurs at the retinal level. Several components are responsible for this pre-processing. Firstly, the distribution of photoreceptors and the ganglion cells is not uniform across the retina. There are two categories of photoreceptors, namely rods and cones. The rods and cones differ in characteristics such as shape, distribution across the retina, patterns of synaptic connections, among many others (Purves et al., 2001). We have 100’s of millions of rods, while the number of cones is around six-seven million (Taylor and Batey, 2012). Moreover, a large portion of the cones is concentrated near the central part of the retina, known as the fovea, and their density greatly decreases as the distance from the center increases. On the other hand, rods are spread out across the entire retina and are absent at the fovea. Figure 3 shows the density distribution of rods and cones across the visual field. In addition, rods are highly sensitive to light, whereas cones provide excellent spatial discrimination and are responsible for color vision. The differences in the rod and cone systems characterize three modes of vision based on the amount of light that reaches the human eye. The first one is scotopic or night vision, which facilitates perception in dim lighting. In this type of vision, only rods are active. Rods cannot discriminate colors, and therefore, scotopic vision is a grayscale vision. The second type of vision is photopic vision, or daylight vision, which provides for color perception mediated by the cone cells. Mesopic vision, the third type of vision, is a combination of photopic and scotopic vision and is perceived in low but not quite dark lighting conditions. Both rods and cones are simultaneously active for this vision. Most indoor and outdoor lighting situations use photopic vision, where cones are the most active ones. However, if one wishes to render dim lighting, the properties of rods are to be considered. Therefore, the perceived level of detail at any point in time also depends on the active mode of vision.
The ganglion cells in the inner part of the retina transmit the information from the photoreceptors to the neural processing system. While we have 100’s of millions of photoreceptors, the number of ganglion cells present in the retina is only around 1.2 million (Kim et al., 2021). This many-to-one connection further compresses the information reaching the brain. Studies (Wässle et al., 1990) observed that the ganglion cell density is also non-uniform across the retina. The number of photoreceptors connected to a single ganglion cell increases with an increase in the distance from the fovea.
The spatially varying photoreceptor and the ganglion cell densities, with a higher density at the center, lead to foveated vision in the human visual system. As a result, the visual sensitivity is maximum at the fovea and reduces towards the periphery. The exact transition point from the foveal to the peripheral region is not defined uniformly across various disciplines. The eccentricity of a point in the retina is defined as the angular distance of that point from the fovea. The fovea is circumscribed by para-fovea, which is circumscribed by peri-fovea. Based on these regions, the human visual field can be broadly divided into central and peripheral vision. Neuro-psychologists typically define central vision as the region from ° to ° - ° eccentricity, with everything beyond that considered as peripheral vision (Loschky et al., 2019). Visual cognition researchers generally consider the region from 0° to 5° eccentricity as the central vision, comprising the foveal region from ° to approximately ° - ° eccentricity and para-fovea from approximately ° - ° to ° and anything beyond 5° is considered peripheral vision. Vision science researchers consider the central vision to extend from 0° to 10° with the peripheral vision beyond that (Loschky et al., 2019). Whatever be the exact definition, the vast majority of the human visual field falls in peripheral vision.
Foveated rendering exploits the foveation in the human visual system and renders the peripheral region at a relatively lower quality. For a more comprehensive survey on the physiological and perceptual aspects of human vision and limitations of the human visual perception, readers are referred to the work by Weier et al. (2017).
3. A Taxonomy of Foveated Rendering Systems
The non-uniform distribution of rods and cones across the retina, along with their varying responses to the incoming light, leads to several interesting properties that change from the foveal to the peripheral vision. For instance, perception of detail reduces with an increase in eccentricity and the time to perceive a same-sized object increases with eccentricity(Sun and Varshney, 2018). Rods have limited spatial discrimination while cones have excellent spatial localization and color discrimination (Purves et al., 2001). However, it is interesting that the ability to detect motion in the peripheral region is higher than what would be expected. Moreover, the ability to differentiate an object’s relative velocity and flicker sensitivity stays uniform across the visual field. In addition, Patney et al. (2016) show that preserving contrast in the peripheral region improves image perception for filtered images. Studies also show that while color perception decreases with an increase in eccentricity, the decrease is gradual and color discrimination survives at high eccentricities (Hansen et al., 2009a). Rosenholtz (Rosenholtz, 2016) argues that peripheral vision is not a mere low-resolution version of foveal vision, but instead is responsible for a texture like representation that preserves the summary statistics of a scene. The different properties that are needed to be preserved in the peripheral region depends on the task at hand. Foveated rendering can leverage these human vision characteristics that vary with eccentricity.
While most real-time rendering uses the rasterization pipeline, ray-tracing provides more realistic images with accurate reflections, refractions, and shadows and is gradually eliciting greater interest (Ludvigsen and Elster, 2010). Foveation can be applied to both rasterization and ray-tracing pipelines. The graphics pipeline for real-time rasterization is often dominated by the computational complexity of shading operations. Shading costs can be reduced by simplifying the operation, pre-computing parts of the operation, sharing results across a set of fragments, and/or reducing the number of operations. Foveated rendering attempts to reduce the average number of operations per pixel by leveraging the non-uniform visual sensitivity across the visual field. We next categorize and compare foveated rendering techniques based on their characteristics and discuss their use in efficient real-time rendering. Figure 4 gives an overview of the presented taxonomy.
discusses various models representing the spatial variance of visual sensitivity, and Section3.3 considers the stage at which foveation is implemented in foveated rendering systems. Section 3.4 gives an overview of the techniques based on their relation to gaze tracking. Finally, Section 3.5 discusses anti-aliasing techniques for foveated rendering.
3.1. Visual Factors
Visual perception depends on various factors including, but not limited to, eccentricity, contrast, luminance, brightness, color, shape, motion, fixations, and saliency. In this section, we survey the foveated rendering methods based on the different visual factors considered.
3.1.1. Content-independent factors
One widely used visual measure is visual acuity, defined as the reciprocal of the angle (in minutes) subtended by a just resolvable region at a given eccentricity. Visual acuity varies across the visual field. Early works (Anstis, 1974) show that acuity drops with an increase in eccentricity. Therefore, reducing the resolution as a function of angular distance from the gaze point has always characterized foveated rendering. One of the distinguishing features among the various foveated rendering approaches has been the way this reduction is accomplished. One common way is to render multiple discrete layers independently; each sampled at a different rate (2012; 2016). These layers, called eccentricity layers, are later upsampled to the screen resolution and composed together. As shading is the most time-consuming operation in the rendering pipeline, certain works (2014; 2014; ) vary the shading rate across the visual field, shading only once per several pixels in the peripheral region. Some recent works (Meng et al., 2018; Koskela et al., 2019; Tursun et al., 2019) also demonstrate varying resolution in a more continuous manner. These methods map the visual field to a non-linear space where a uniform rendering matches the visual acuity of human vision. While the above methods are mainly developed using a rasterization pipeline, ray tracing methods naturally allow for smooth non-uniform sampling in the screen space. Weier et al. (2016) and Fujita and Harada (2014)
incorporate foveated rendering into ray-tracing and achieve sparse sampling by varying the sampling probability with an increase in distance from the gaze point. For more discussion on the trade-offs between discrete and continuous distributions, we refer the readers to the recent work on foveation displays(Spjut and Boudaoud, 2019). The idea behind these methods is to mimic a distribution that closely aligns with the visual acuity of the HVS. Different methods approximate the non-uniform sampling distribution function differently to best match acuity across the visual field; we discuss this in detail in Section 3.2.
Furthermore, recent studies (Barbot et al., 2021; Abrams et al., 2012) show that the visual acuity is higher across the horizontal axis than the vertical axis at the same eccentricity values. Studies further show that the lower visual field has better visual acuity. Most of the current foveated rendering techniques do not factor this asymmetry during rendering and assume uniform acuity across horizontal and vertical directions. One exception is the recent method by Ye et al. (2022) that processes the horizontal and vertical visual fields independently, achieving a superior image quality.
3.1.2. Content dependent factors
Image contrast is another major factor impacting visual perception. The minimum amount of contrast (contrast threshold) required to detect a pattern depends on its spatial frequency, in addition to the content-independent eccentricity. The reciprocal of contrast threshold is called contrast sensitivity, and it is characterized by the contrast sensitivity function (CSF) across the human visual field. The function describes the capacity of the HVS to recognize differences in patterns as a function of spatial frequencies (Bull, 2014). Similar to spatial acuity, contrast sensitivity for a given spatial frequency is highest at the fovea and decreases with increasing eccentricity. The contrast sensitivity depends on several factors such as size, contrast, and the viewing angle of the target pattern, and several functions for modeling CSF have been proposed in the literature (Pointer and Hess, 1989; Barten, 2003).
The relationship between contrast sensitivity and spatial frequency varies across the visual field (Thibos et al., 1996). In general, the foveal CSF shows a gradual loss in contrast sensitivity with increasing spatial frequency, while peripheral CSF shows that contrast sensitivity drops abruptly. The contrast sensitivity function has a cut-off spatial frequency beyond which a higher frequency is not discernible. The cut-off spatial frequency depends on the eccentricity of the image patch location in the visual field. Further, the contrast sensitivity function is task-dependent. The peripheral CSFs for detection and discrimination tasks indicate different limits to performance at high spatial frequency (Anderson et al., 2002). The peripheral contrast sensitivity for the discrimination task is observed to degrade at a much faster rate than for the detection task, as demonstrated in Figure 5. The difference between the cut-off frequencies for the detection task and the discrimination task indicates that the human visual system can perceive higher frequencies in the periphery but cannot resolve details. The range of frequencies between the detection and discrimination cut-off frequencies is known as the aliasing zone. The foveated renderers that determine the sampling rate based on the discrimination cut-off frequency provide faster performance than those based on the detection cut-off frequency. However, rendering the peripheral region based on the discrimination cut-off frequency reduces contrast for regions whose frequencies are in the aliasing zone. To maintain the detection ability, Patney et al. (2016; 2016) suggest enhancing peripheral contrast in the aliasing zone, which was degraded by filtering to maintain the perceptual quality of a non-foveated image. Tursun et al. (2019)
exploits the influence of luminance contrast on visual perception. They observe that while a given reduction of spatial resolution in high-contrast regions reduces the perceived quality, the same reduction in low-contrast regions maintains it. The minimal required resolution for each region is estimated using the local luminance contrast and the angular distance from the gaze point. In addition to luminance, the sensitivity to color also reduces towards the peripheral region. The hue resolution or the number of gray levels within each RGB channel that can be perceived reduces with eccentricity(Hansen et al., 2009b). Liu et al. (2008) show that the number of bits representing each color channel can be monotonically reduced from 8 to 4 as the eccentricity increases from 0° to 30°.
Visual saliency, often a good indicator of visual attention, is another factor influencing the perceived level of detail. Visual saliency is a distinct subjective quality where certain regions of a scene are more likely to draw the viewer’s attention when compared to their surrounding regions. The concept of foveal and attentional spotlights states that the attention and gaze do not necessarily coincide (Levin et al., 2011). As the user’s attention is automatically attracted to the visually salient stimuli, identifying such salient regions can redistribute the rendering cost by allocating more resources to the visually important regions. Many computational models have been introduced to estimate visual saliency (Kim et al., 2010; Lee et al., 2005; Song et al., 2014; Yohanandan et al., 2018)
. In addition to the low-level features like local color and brightness contrast, high-level features such as the position and identity of the objects and the scene context have a significant influence on the visual attention model. The set of visually salient regions is not uniquely considered across various research works. For example, the adaptive multi-rate shading method by Heet al. (2014) considers regions near object silhouettes, shadow edges, and regions of potential specular highlights as visually salient. A higher sampling rate is used to render these regions so that they are perceived at relatively higher acuity. On the other hand, Stengel et al. (2016) consider the regions of high spatial and temporal contrast and saturated colors as visually significant to develop a foveated rendering system.
3.2. Eccentricity-based analytical visual models
Foveated rendering systems vary resolution or sampling rate across the visual field based on one or more visual factors discussed above. Based on the scene content and task, these systems approximate the human visual field using analytical expressions. These approximations allow different kinds of pixel distributions. In this section, we discuss the similarities and differences among the different distributions.
3.2.1. Hyperbolic model
Visual acuity can be quantitatively represented as the reciprocal of the minimum angle of resolution (Weymouth, 1958), which is the smallest angle at which two points are perceived as different. The minimum angle of resolution increases linearly with an increase in eccentricity. One possible mathematical characterization of visual acuity over eccentricity is given by
where, corresponds to the smallest resolvable angular resolution that occurs at the fovea () and represents the slope. We refer to this distribution as the hyperbolic model where the acuity changes as a function of 1/eccentricity. This hyperbolic model serves as a good approximation for visual acuity at low eccentricities (less than ° angular radii) (Guenter et al., 2012), after which the acuity drops more steeply. Many foveated renderers (Guenter et al., 2012; Swafford et al., 2016; Patney et al., 2016; Vaidyanathan et al., 2014; He et al., 2014; Stengel et al., 2016) base their sampling distribution on this hyperbolic fall-off of acuity.
Guenter et al.’s (2012) pioneering work shows that significant performance improvement is possible with foveated rendering for rasterization. Assuming that a few discrete layers are sufficient to model the acuity fall-off in the human visual system, they use three nested and overlapping rectangular layers rendered at different resolutions, as shown in Figure 6
. These three layers, known as eccentricity layers, are centered at the gaze point or the fovea. The innermost, foveal layer, is rendered at the highest resolution (which is the native display resolution). The middle layer is larger than the inner layer and is rendered at a relatively lower resolution. The outermost layer covers the entire screen and is rendered at a much lower resolution. All three layers are then interpolated to the display resolution and blended smoothly. The hyperbolic function in Equation1 is used to compute the size and resolution for the eccentricity layers. The parameters depend on the minimal angular resolution slope , which is estimated from the user studies. The approach assumes symmetric radial acuity fall-off, ignoring the differences in the horizontal and vertical axes of the visual field. They limit their method to three layers to approximate the human visual system, although more layers can give a better approximation at the cost of increased complexity. Thus, the the total number of rendered pixels is reduced compared to rendering the entire frame at a uniformly high resolution.
Stengel et al. (2016) extend the above model to incorporate the effect of smooth-pursuit eye movements. This type of eye movement is triggered unconsciously when a moving object attracts attention. They model the region of focus as a straight line, obtained as integration of gaze positions over consecutive frames. The eccentricity is now measured based on the distance to this line. Consequently, the rendering at the highest-resolution is over a larger elliptical region enveloping the straight line, instead of a circular region around a single point. This method gives superior performance on higher latency displays.
Spjut et al. (2020) characterize the acuity distribution of HVS using a hyperbolic model and evaluate a foveated display based on how well it matches this distribution. For natural viewing of images, the hyperbolic model has proven to be accurate for small eccentricity values (Stengel et al., 2016; Guenter et al., 2012). However, often, the region beyond ° eccentricity is rendered at a uniform lower resolution, which potentially limits the extent of speedup that can be achieved through foveated rendering.
3.2.2. Linear model
Weier et al. (2016) consider a linear fall-off of acuity with increasing eccentricity as opposed to the above hyperbolic fall-off. By modeling a linear model into the ray-tracing pipeline, the sampling probability reduces with an increase in distance from the gaze point. The visual field is divided into three regions, similar to Guenter et al. (2012). Each region is characterized by three parameters: as shown in Figure 7. The inner region (the region which falls within degrees) is the foveal area, which is rendered at full resolution and so is sampled with a probability of one. The pixels in the outermost peripheral region (the region that lies beyond degrees from the gaze point) are sampled with a minimum probability of . The pixels in the region between the two layers (between and ) are sampled according to the linear equation:
where is the distance of the pixel from the center of the visual field in degrees. , , and are user-defined parameters. The linear approximation model is more relaxed and holds good only for small eccentricity values, beyond which the region is rendered at a uniform lower resolution. However, the linear fall-off maintains motion perception in the periphery better than the model with hyperbolic fall-off due to the increasing sampling rate.
3.2.3. Logarithmic model
In the human visual system, a log-polar mapping of the eye’s retinal image approximates the excitation process of the visual cortex (Araujo and Dias, 1996). This log-polar mapping ensures that the sensitivity to perceive fine details is high at the center of the visual field and decreases logarithmically with an increase in distance from the fovea. Contrasting the above models that use multi-resolution rates during rendering, a mapping from Cartesian space to log-polar space that matches human visual acuity can use a single uniform resolution rendering, applied directly to the transformed space.
Log-polar mapping has been proven useful in many areas like computer vision, robotics, computer graphics, and image processing because of its power to provide enough visual detail using limited computational resources(Antonelli et al., 2015).
Meng et al. (2018) provide a foveated rendering method for meshes using the log-polar mapping of the human visual system. Their system introduces a kernel log-polar mapping technique that offers the flexibility to model acuity fall-off that matches HVS. Figure 8 shows the mapping from the Cartesian space to the log-polar space. The acuity decreases logarithmically and depends on the kernel. The rendering acceleration in this technique is achieved through deferred shading (Deering et al., 1988), a widely-used technique in real-time rendering. The information about the positions, normals, textures, and materials for each surface necessary for shading computations is rendered into the geometry buffer (G-buffer). The contents of the G-buffer are transformed from the Cartesian space to the log-polar space. The direct and indirect lighting at each pixel is computed and rendered to the reduced resolution log-polar buffer. Lighting calculations are performed in the reduced log-polar space. Inverse kernel log-polar mapping is applied to map the shading back to the Cartesian screen space. This method is able to systematically vary the sampling rate and sampling distribution continuously in the log-polar space.
Koskela et al. (2019) introduce a similar mapping from Cartesian space to visual-polar space. The path tracing with one sample per pixel is performed in the visual-polar space. The polar-coordinate space is modified so that the sampling distribution aligns with the human visual acuity distribution. To match HVS, the number of samples along the angular and the radial axes can be adjusted. They observe that varying the number of samples along the angular axis results in peripheral region artifacts while varying the samples along the radial axis leads to foveal region artifacts. Based on these findings, their optimization technique varies the resolution along the angular axis in the fovea and rescales the radial axis in the peripheral region. Denoising is applied to the noisy path-traced visual-polar space image (Koskela et al., 2019, 2019). The reconstructed visual-polar space image is transformed back to the Cartesian coordinates using the inverse mapping. They report that the visual-polar mapping reduces distracting artifacts compared to the log-polar mapping.
3.2.4. Other Approximations :
Conformal rendering (Bastani et al., 2017) models visual acuity as a non-linear function of eccentricity. The technique aims to mimic the smooth transition from the foveal to the peripheral region using a non-linear mapping of the screen distance from the eye-gaze point. The projected vertices of the virtual scene are warped into a non-linear space that matches the retinal acuity and the HMD lens characteristics. The warped image is then rasterized at a reduced resolution and unwarped back into Cartesian space. The complexity of conformal foveated rendering depends on the complexity of the scene. As the number of vertices in a scene increases, the performance reduces. In contrast to the methods that use discrete layers to model the acuity fall-off (Guenter et al., 2012; Weier et al., 2016), the methods by Meng et al. (2018) and Bastani et al. (2017) simulate the acuity fall-off from foveal to the peripheral region using a continuous smooth function with non-linear mapping. This smooth transition helps in reducing visual artifacts.
Reddy (2001) proposes a visual acuity model as a function of the angular velocity of a stimulus projected onto the retina and the eccentricity. The visual acuity varies as an inverse quadratic function of eccentricity (Reddy, 1998). Zheng et al. (2018) develop a foveated rendering method based on the visual acuity model by Reddy (2001) and adjust the tessellation levels accordingly. Friston et al. (2019) use a simple radial power-falloff function that maps the distance from the gaze point to a foveated distance. The pixel locations are then scaled based on this foveated distance. The overall effect magnifies the region close to the gaze point, giving more importance to the foveal region. They assume that the foveation function is arbitrary and free to change every frame, but require the function to be invertible to map the foveated image back to an unfoveated one for display. Fujita and Harada et al. (2014) assume that the acuity drops off as a function of , where is the distance from the gaze point and define a sampling distribution function correspondingly. Recently, Li et al. (2021) presented a log-rectilinear mapping-based foveated rendering to model exponential decay in resolution with an increase in eccentricity.
The various perceptual models described above mimic the visual sensitivity of the HVS across the visual field. They differ mainly in their motivation, complexity, and distribute samples with subtle differences. The most commonly used hyperbolic model mimics the very early operation of retinal projection in the eye, whereas logarithmic model represents the visual cortex part of HVS. Computational complexity is a major factor for practical purposes, especially when real-time rendering is required. The least complex, linear model, results in a higher resolution level in the peripheral region compared to the other models. The logarithmic model offers a more tunable fall-off distribution.
Table 1 shows how the visual acuity degrades as a function of eccentricity for each model. Futhermore, Figure 9 shows the resolution factor distributions across the visual field. The parameters are chosen from the corresponding papers. The models mainly differ in the resolution requirements in the near-peripheral region. The hyperbolic function offers a drastic resolution reduction, followed by kernel log-polar, inverse quadratic, and linear fall-offs. The kernel log-polar mapping demands higher resolution in the near periphery compared to the hyperbolic fall-off. In the far periphery, all the models converge to similar resolutions. One can choose a resolution distribution function from many of these approximations based on the scene complexity and the task to be performed. Moreover, we have seen that several visual factors impact the perceptual quality of a scene. Most of the existing models consider only a few of these visual factors. There has not been much research thus far in weighing these factors and their inter-relationships to develop an overall perceptual model. Furthermore, the visual sensitivity can change dynamically based on the external stimuli presented to the user, further increasing the complexity of the required model. Techniques for simplifying the complex models that consider the various inter-relationships between the visual factors without affecting the perceived visual quality or performance are therefore important.
|Distribution model||Function governing change of resolution|
3.3. Foveation Space
We classify foveated rendering techniques into screen-based, object-based, or optics-based methods based on the stage of the rendering pipeline that incorporates the concept of foveation: the screen space, the object space, or the optical space (Figure 10).
3.3.1. Screen-based Methods
Most foveated rendering techniques vary the sampling rate in the screen space based on the distance from the point of focus or the gaze point. Screen-based foveated rendering approaches (Guenter et al., 2012; Patney et al., 2016; Meng et al., 2018; Vaidyanathan et al., 2014; He et al., 2014) involve manipulating the frame-buffer contents just prior to the display to reduce the overall shading rate. Vaidyanathan et al. (2014) provide an architecture called coarse pixel shading that enables sparse shading operations for the peripheral region in screen space. Swafford et al. (2016) study the effect of foveation on ambient occlusion which is associated with a high computational cost on modern real-time rendering pipelines. Specifically, they present the effect of varying per-pixel depth samples in foveal and peripheral regions for Screen-Space Ambient Occlusion, a technique which approximates ambient occlusion. They show that the banding effect arising from a low number of per-pixel samples is imperceptible in the peripheral region due to the reduced acuity, thus improving performance. Another foveated renderer that makes changes in the screen space is conformal rendering (Bastani et al., 2017) in which the projected vertices are warped to a non-linear space before rasterization. The image is then rasterized at a reduced resolution and finally unwarped into the Cartesian space.
3.3.2. Object-based Methods
Object-based foveated rendering methods involve manipulating the model geometry prior to rendering. One of the very early works on foveated rendering by Levoy and Whitaker (1990) is an object-based approach. Following a ray-tracing approach for volume rendering, the number of rays cast per unit area and the number of samples per unit length along each ray are changed based on the pixel’s angular distance from the gaze direction. The system prefilters the 3D volume using a 3D MIPmap and uses fewer samples in the peripheral region. Several early works (Murphy and Duchowski, 2001; Ohshima et al., 1996; Danforth et al., 2000; Luebke and Hallen, 2001) also model objects based on the gaze direction. The desired level of detail for the object is determined based on its distance from the gaze point. Swafford et al. (2016) have developed another object-space technique. They vary tessellation levels in the foveal and peripheral regions. A higher tessellation factor is used for the tiles that will fall within the foveal area after screen-space projection. For those that fall in the peripheral region, a lower tessellation factor is used. Linear interpolation determines the level of tessellation for the tiles that fall in the region between the foveal and peripheral areas. Their results show performance gain compared to rendering the entire scene at a uniform tesselation. Along similar lines to Swafford et al. (2016), Zheng et al. (2018) propose a method to tessellate based on visual sensitivity at a given eccentricity. They suggest removing all the imperceivable polygons and then dynamically tessellating the model. The polygons whose edge length in the screen space is less than the user’s minimum perceptible length are not further sub-divided. The tessellation level for the perceivable polygons is given by the ratio of the edge length of the polygon in the screen space to the minimum perceptible length.
3.3.3. Optics-based Methods
Optics-based foveated displays involve optically manipulating light from one or more displays (Wu and Kim, 2020; Yoo et al., 2020; Kim et al., 2019; Lee et al., 2020; Yoshida et al., 1995a; Rolland et al., 1998). Such methods employ additional optical components in the device design, and so the overall form factor and the manufacturing cost of the device increases.
Foveation effect at optical level can be achieved by either using a single display (Yoo et al., 2020), or two separate displays (Kim et al., 2019; Lee et al., 2020; Yoshida et al., 1995a; Tan et al., 2018b).
Yoo et al. (2020) propose a single-display-based near-eye foveated system based on temporal polarization multiplexing and provides two operating modes, whereas Kim et al. (2019) and Lee et al. (2020) use two displays – one for the foveal region with a narrow FOV and the other for the peripheral region with a wide FOV.
The concept of double displays dates back to the late 1990s. Double-display systems optically combine the light from two display modules, each with a different resolution and field of view. Yoshida et al. (1995a) propose a high-resolution-inset HMD, which optically superimposes a high-resolution image over a low-resolution wide FOV image. The part of the scene around the gaze point is generated at a high resolution. The high-resolution-inset is optically duplicated into a grid of non-overlapping copies to fill the entire display (Yoshida et al., 1995a; Rolland et al., 1998). An LCD array selects one element of this grid based on the gaze direction and transports it to the eyes through optical fibers (Yoshida et al., 1995b). The background image covers the entire FOV and is generated at a lower resolution. The foveal inset is then combined with the background image using opto-electrical components alone. However, these devices were too heavy and expensive when they were proposed. With recent developments in the optical elements, Kim et al. (2019) and Lee et al. (2020) present prototypes for foveated near-eye devices using two displays with a small form factor. Kim et al. (2019) use a micro OLED display for the foveal display system and a projector-based Maxwellian-view display for the peripheral display system. The two displays are mechanically steered towards the gaze direction based on the built-in eye tracker’s information. Lee et al. (2020) use a holographic near-eye display for the foveal display and elements based on polarization optics for the peripheral display system. The foveal display is steered using a micro-electro-mechanical system (MEMS) mirror and a switchable Pancharatnam–Berry phase (S-PBP) grating module. The peripheral display is ensured to support enough eye box, thereby avoiding the need to steer the display. The light coming from these two displays are then combined optically before reaching the eye.
The concept of foveation is also increasingly used in holographic displays (Yaraş et al., 2010; Maimone et al., 2017), a highly promising direction to true 3D displays with powerful features such as variable focal control, optical aberration correction, and non-conflicting depth cues, to reduce the rendering cost of computer-generated holograms (CGH) (Chang et al., 2020; Ju and Park, 2019; Wei and Sakamoto, 2019). For example, Chang et al. (2020) use the concept of foveated rendering to speed-up the computational cost of CGH for 3D volumes based on a layered approach. The volume is divided into several parallel layers, and the image at each layer is segmented into a foveal region and a peripheral region. The foveal region is considered at a high resolution for the hologram calculation, whereas the peripheral region is down-sampled to a lower resolution. As a result, the reconstructed images from the hologram appear with a higher quality in the foveal region than the peripheral region.
3.4. Gaze Point Information
This section discusses the use of eye-tracking knowledge for foveated rendering systems (Figure 11) and classifies the methods based on the configuration in which they were originally proposed.
3.4.1. Static Foveated Rendering
Static foveated rendering techniques do not rely on the eye-tracking devices and assume that a user focuses on specific positions in the image. The likelihood of a user glancing at a particular area estimates the desired acuity level in that region. Many early works exploited static foveation, in which they either assume that the user looks at the center of the screen (Funkhouser and Séquin, 1993) or use a content-based model of visual attention (Horvitz and Lengyel, 1997; Yee et al., 2001). The direction of the head was considered a good approximation for the gaze direction before the eye-tracking devices were developed. Most of the early work on foveated rendering is based on this approximation and assumes that the gaze point is at the center of the screen.
One example of static foveated rendering is the fixed foveated rendering (FFR) (Oculus, 2018) developed by Oculus, which assumes that most users look towards the center of the display. FFR uses a tile-based approach. The image is sub-divided into tiles of varying acuity levels, based on the distance from the center. The tile corresponding to the higher resolution region lies at the center while the tiles towards the edges correspond to lower resolution. Each tile is rendered at a uniform spatial resolution. Static foveated rendering requires the users to maintain their focus at predefined fixed zones, mostly the center of the screen. A significant advantage of the static foveated rendering techniques is that they do not depend on the quality of the eye-tracking devices and are compatible with all existing devices. Moreover, static foveated rendering approaches can be further improved with the recent improvements in the visual attention and saliency models (Itti et al., 1998; Lee et al., 2005; Kim et al., 2010). The area around the salient regions is always rendered at a higher resolution. However, static techniques require a larger high-resolution region than the dynamic approaches. So the average rendering cost for a static foveated rendering technique could be higher than the dynamic method.
3.4.2. Dynamic Foveated Rendering
Traditional foveated rendering is predicated on eye-tracking technology (Reddy, 2001). Eye-tracking devices can pinpoint the user’s gaze position in real-time. Research on head-mounted eye trackers dates back to the 1960s (Clay et al., 2019). However, the recent advancements in computational power, massively parallel image processing, low cost, and small-sized hardware have made it possible to use real-time eye-tracking with VR and AR. By actively tracking the user’s gaze using eye-tracking tools, a small region around the gaze point is rendered at a higher resolution and the peripheral region at a lower resolution. Most of the foveated rendering approaches (Guenter et al., 2012; Patney et al., 2016; Swafford et al., 2016; Tursun et al., 2019; Stengel et al., 2016)
are dynamic approaches and depend on the performance of the eye-tracking devices to obtain the gaze point. In addition to the hardware-based eye-trackers, there has been a growing interest in developing methods to predict future gaze positions using deep-learning methods(Xu et al., 2018; Hu et al., 2020). As the accuracy and efficiency of the eye-tracking solutions improves further, dynamic foveated rendering is likely to greatly enhance the rendering performance and quality.
Aliasing is prominent in foveated rendering as the peripheral region is rendered at a lower resolution. The aliasing artifacts stemming from foveated rendering can be either spatial or temporal.
Spatial aliasing artifacts occur when the level of detail of the virtual world is higher than the rendered resolution (Hoffman et al., 2018a). They arise at the object level and appear irrespective of any motion. Aliasing during scene motion generates temporal artifacts. These artifacts are aligned to the output display pixel grid rather than the virtual world coordinates (Hoffman et al., 2018a). As the user view changes, these artifacts result in flickering and scintillation effects, disrupting the user experience in the virtual world.
Studies show that the human visual system is sensitive to temporal aliasing artifacts even at higher eccentricities (Hoffman et al., 2018a, b). McKee and Nakayama (1984) show that motion acuity falls dramatically from ° to ° and then drops more subtly from ° to ° . The peripheral motion sensitivity of the human visual system poses a critical challenge to all foveated rendering techniques. Patney et al. (2016) show that minimizing temporal aliasing in foveated rendering is necessary for an effective user experience.
3.5.1. Avoid Artifacts
One approach to ensure temporal and spatial stability is to design foveated rendering techniques that prevent artifacts from occurring. These techniques generally identify the feature that might cause a detectable artifact and remove that feature or nullify its effect in the rendering pipeline (Bastani et al., 2017). The foveal and peripheral regions are constantly updated according to the gaze direction. In general, the position of the rendered pixels in the high-resolution foveal region and the low-resolution peripheral region aligns with the display-coordinate system. The low-resolution rendered pixels are later upsampled to match the native display resolution. As the user’s head rotates, the value of each rendered pixel changes irrespective of the scene content, causing the pixel color to shift and flicker. This generates the time-varying aliasing artifacts in the upsampled display pixels, thus disrupting the user experience. Turner et al. (2018) observe that proper angular alignment of the rendered frustums minimizes such frame-to-frame flickering effects with head rotation. They present a phase-aligned foveated rendering system that aligns the low-resolution region to the world-coordinate system rather than the display-coordinate system as shown in Figure 12. The low-resolution region, which is sampled in the world space, is then upsampled and re-projected to align with the display-coordinate system. The artifacts in the peripheral region now move along with the content of the virtual scene, and so the temporal variation of the artifacts is minimized as the head rotates. Phase-aligned foveated rendering significantly reduces the perceivability of motion artifacts in the peripheral region, making aggressive foveation possible. However, spatial aliasing and a distinct boundary between the foveal and peripheral regions are noticeable in the rendered frame. The discern-ability of the foveal-peripheral boundary can be reduced by ensuring a smooth transition in the resolution levels from foveal to the peripheral region (Bastani et al., 2017). Franke et al. (2021) and Mueller et al. (2021) use temporal coherence to reduce the number of shading operations. Franke et al. (2021) reproject the peripheral region of the previous frame to the current frame, and only the foveal pixels and the less coherent peripheral pixels are rendered. They propose to use the exact world space pixel positions of the fragments rather than the depth values to avoid reprojection artifacts.
3.5.2. Mitigate Artifacts
Most of the foveated rendering approaches reduce the perceptibility of the artifacts in the post-rendering phase. The choice of interpolation techniques used to up-sample the low-resolution peripheral region also affects the detection of the artifacts in the peripheral region. For example, the simple nearest-neighbor interpolation magnifies the temporally unstable artifacts (Hoffman et al., 2018b), though it can offer good contrast retention when compared to bi-linear interpolation.
Guenter et al. (2012) use a combination of multi-sample anti-aliasing (MSAA), temporal reverse reprojection, and temporal jitter of the spatial sampling grid to reduce the spatial and temporal artifacts. MSAA reduces spatial aliasing along silhouette edges by increasing the effective sampling resolution in those regions. Temporal reverse projection with frame jitter reduces the aliasing throughout the image. Temporal anti-aliasing strategies are better at reducing aliasing arising from sampling highly specular materials at a lower sampling rate. Reprojection-based temporal anti-aliasing is a common technique that uses the previous frame’s information to mitigate the temporal artifacts. The current frame is reprojected onto the previous frame, and the information from the two frames is used to compute the final color. However, as details from the previous frame may continue to exist beyond when they can be correctly reprojected, temporal anti-aliasing may produce high-frequency artifacts or ghosting (Patney et al., 2016). Karis (2014) reduces such ghosting artifacts by conditioning samples from previous frames that are consistent with samples in the current frame. Meng et al. (2018) apply temporal anti-aliasing with Halton sampling in the screen space as a post-processing method to mitigate the artifacts appearing in the peripheral region.
Patney et al. (2016) use pre-filters in addition to temporal anti-aliasing methods to mitigate temporal artifacts. They introduce variance sampling as a post-process image-enhancement technique. Around each pixel, a variable-size axis-aligned bounding box is constructed based on local color distribution. The back-projected and resampled information from the previous frame that lies within the defined bounding box is integrated with the current frame information. The authors show that by explicitly using the local color information, the ghosting artifacts are reduced. This method is further extended to account for saccadic eye movements. Due to the saccadic movement of the eye, the previous frame’s peripheral region can become the foveal region for the current frame. Such a situation requires information from several frames to converge to the level of detail required for the foveal region. Patney et al. (2016) accelerate the rate of convergence based on the shading rates in the two consecutive frames. This reduces the blurring artifacts caused by eye saccades.
Weier et al. (2018)
use depth-of-field information to design a post-process anti-aliasing technique. The depth-of-field effect that occurs when focusing on objects can be used as a low-pass filter to minimize the high-frequency artifacts in the peripheral region. Temporal anti-aliasing is first applied to the foveated rendered image to obtain temporally smooth samples. These samples are used to reconstruct the full image using push-pull interpolation. The aliasing artifacts are then mitigated using a low-pass depth-of-field filter. The approximate depth of the focal point is estimated using a support vector machine-based gaze depth estimator that takes various depth measurements as input. Given the gaze-depth and estimation inaccuracy, two circles of confusion are computed, which determine the depth-of-field. A multi-layer filter is designed based on the depth-of-field model. The filtering of the image occurs in layers, which are eventually blended with different weights to give the final output.
In addition to the above methods, denoising approaches based on deep neural networks have been developed to reduce the detection of artifacts. DeepFovea(Kaplanyan et al., 2019)
takes inspiration from the internal model of the human visual system that infers content from the sparse peripheral information. A manifold representing the distribution of samples from a large collection of natural videos is learned using generative adversarial networks(Goodfellow et al., 2014). For a given sparse foveated input video stream, DeepFovea reconstructs the peripheral region by finding the closest natural video that corresponds to the sparse input on the learned manifold. As the reconstructed output is close to the realistic videos, the temporal aliasing artifacts are minimized.
|Funkhouser and Séquin (1993)||✓||-||✓||✓||-||-||-||✓||-||-||✓||-||-||-|
|Leubke and Hallen (2001)||✓||✓||✓||-||-||-||-||✓||-||-||-||✓||-||-|
|Murphy and Duchowski (2001)||✓||-||-||-||-||-||-||✓||-||-||-||✓||-||-|
|Guenter et al. (2012)||✓||-||-||✓||-||-||-||-||✓||-||-||✓||-||✓|
|Fujita and Harada (2014)||✓||-||-||✓||-||-||-||-||✓||-||✓||-||-||✓|
|Stengel et al. (2016)||✓||✓||✓||✓||-||-||-||-||✓||-||-||✓||✓||✓|
|Swafford et al. (2016)||✓||-||-||✓||-||-||-||✓||✓||-||-||✓||-||-|
|Patney et al. (2016)||✓||✓||-||-||✓||-||-||-||✓||-||-||✓||-||✓|
|Weier et al. (2017)||✓||-||-||-||✓||-||-||-||✓||-||-||✓||✓||✓|
|Turner et al. (2018)||✓||-||-||-||-||-||-||-||✓||-||-||✓||✓||-|
|Zheng et al. (2018)||✓||-||-||-||-||-||✓||✓||-||-||-||✓||-||-|
|Meng et al. (2018)||✓||-||-||-||-||✓||-||-||✓||-||-||✓||-||✓|
|Weier et al. (2018)||✓||-||-||-||✓||-||-||-||✓||-||-||✓||-||✓|
|Koskela et al. (2019)||✓||-||-||-||-||✓||-||-||✓||-||-||✓||✓||✓|
|Tursun et al. (2019)||✓||✓||-||✓||-||-||-||-||✓||-||-||✓||✓||-|
|Kim et al. (2019)||✓||-||-||-||-||-||-||-||-||✓||-||✓||-||✓|
|Lee et al. (2020)||✓||-||-||-||-||-||-||-||-||✓||-||✓||-||✓|
|Yoo et al. (2020)||✓||-||-||-||-||-||-||-||-||✓||-||✓||-||✓|
One can measure the performance of a foveated system across two dimensions – the quality of the rendered image and the average rendering time. However, quality evaluation is a challenging task as a ground-truth reference foveated image does not exist. The peripheral region is sensitive to flickering artifacts, spatial and temporal motion, contrast, and salient features (Patney et al., 2016) and it is not easy to quantify these perceptual metrics. Further research is required to understand the important characteristics of human perception. Also, the optical components of the HMD limit the perceived resolution. The spatial resolution of optical systems is commonly expressed in terms of the modulation transfer function (MTF), which gives the normalized frequency response of the system. Beams et al. (2020) show that the maximum resolution provided by the optical system depends on the eccentricity in the FOV. This dependence on the optical components also has to be considered while evaluating the foveated rendering system. This section gives an overview of the common qualitative and quantitative measures used to evaluate a foveated system.
4.1. User study-based evaluation
Empirical user studies are the most common and reliable way to evaluate foveated rendering systems. These user studies can be broadly classified into three categories (Hsu et al., 2017):
single-stimulus absolute category rating or pairwise comparisons;
double-stimulus quality comparison;
slider methods or adjustment methods
Single-Stimulus Absolute Category Rating: The participant is presented with a single stimulus, a foveally-rendered image, on the screen for a certain time. The participant is then asked to make a judgment on the quality of the image, ranging from acceptable to unacceptable.
Double-Stimulus Quality Comparison: This type of experiment requires the availability of a full-resolution image. Two renderings of the scene, a full-resolution (unfoveated) image, and a foveated image are shown to the participant in a random order. The participant is then asked to compare the two images and determine the perceptually superior image.
Slider Methods: These methods are more helpful in finding the optimal parameters for the foveated-rendering system. There are two types of slider experiments - descending and ascending. In descending methods, the participant is presented with a full-resolution high-quality image and a slider. The slider is used to control the parameters of foveation. By moving the slider position gradually, the participant is asked to determine the point at which the quality of the image degrades. The ascending method is similar to the descending method, but the participant is initially provided with a reference full-resolution image for a short period. The quality of the image gradually increases from a fully foveated image as the slider moves, and the user is asked to determine the point at which the quality of the image is perceptually similar to the reference image.
The user studies also provide personalized calibration of optimal foveation parameters since the level of acceptance of foveated imagery may vary from person to person. However, user studies are generally very time-consuming and expensive. Therefore, perception and foveation-based evaluation metrics are required to help us design and compare various methods. The user-study experiments can help us evaluate methods that have a high probability of success based on the initial evaluation metrics.
4.2. Computational metric-based evaluation
Traditional image quality metrics are designed for uniform quality images. Structural Similarity Index (SSIM) and HDR Visual Difference Predictor (HDR-VDP2) are well-known perceptually-informed metrics (Swafford et al., 2016). SSIM (Wang et al., 2004) uses structural distortion in an image as an estimate of perceived visual distortion. It is based on the assumption that HVS is highly adapted for the extraction of structural information from the visual field. SSIM uses mean, variance, and covariance of the original high-resolution image and the foveated image to measure their structural differences. HDR-VDP2 (Mantiuk et al., 2011) considers the contrast sensitivity measurements. These metrics were modeled with the assumption of a uniform sensitivity across the visual field.
Several recent papers (Swafford et al., 2016; Guo et al., 2018; Tsai and Liu, 2014; Wang et al., 2001; You et al., 2014; Rimac-Drıje et al., 2010; Sanghoon Lee et al., 2002) extend the uniform image quality metrics to consider foveation. Rimac et al. (2011) combine SSIM with a foveation-based sensitivity function to account for the non-uniform perceived structural differences across the visual field. Swafford et al. (2016) extend the HDR-VDP2 metric by including the contrast-sensitivity degradation factor. Tsai et al. (2014) develop a window-based metric and assign weights based on the distance from the predicted salient regions. Guo et al. (2018) vary the spatial resolution based on eccentricity to calculate the quality of the image. Recently, Mantiuk et al. (2021) developed a novel metric named FovVideoVDP that incorporates several factors such as spatial resolution variance with eccentricity, spatial and temporal contrast sensitivity, and scene content. Most of these metrics have been developed for foveated video quality assessment and rely on a full-resolution reference image or video. However, complete perceptually informed and computationally simple metrics to evaluate the perceived quality of wide FOV images and videos and level of immersion are relatively unexplored.
In terms of rendering time, we provide the results reported for each of the discussed foveated rendering methods. This, however, should not be interpreted to be a relative comparison among the different methods unless explicitly noted otherwise; the overall rendering time varies with the complexity of the scene, shading models, and the hardware used. Guenter et al. (2012) report an overall speedup of – in rendering time and – in the number of pixels rendered. Multi-rate shading (He et al., 2014) provides an average speedup of about – and the multi-resolution approach (Reed, 2015) provides a speedup of – . Zheng et al. (2018) reduce the rendering time by a factor of compared to Guenter et al. (2012). Patney et al. (2016) report more efficiency in reducing shading cost compared to Guenter et al. (2012). Stengel et al. (2016) report a speedup of in overall rendering time while reducing the shading time by . Swafford et al. (2016) reduce the rendering time in half by carefully reducing the resolution in the periphery. The kernel-foveated rendering method by Meng et al. (2018) achieves – speedup by using log-polar transformations. By estimating the maximum resolution required for each pixel, Tursun et al. (2019) report speedup in overall rendering time. The ray-tracing-based foveated rendering method by Weier et al. (2016) provides an average speedup of , reducing the number of sampled pixels by 79% on benchmark scenes. By using depth-of-field filtering as a post-processing step, Weier et al. (2018) reduce the number of sampled pixels by 69% without any visible artifacts.
In conclusion, we have described the limitations of the human visual system and how foveated rendering leverages such limitations to reduce the computational resources required for real-time rendering. We compare and discuss various foveated rendering works by providing a taxonomy based on the key factors in the rendering pipeline. Table 2 provides a summary of our taxonomy. Our first classification differentiates methods based on the visual factors considered while developing the rendering system: content-independent (acuity) and content-dependent (image contrast, saliency). As can be seen from table 2, all the papers consider acuity degradation with eccentricity in their rendering systems. In addition, a few works (Tursun et al., 2019; Patney et al., 2016; Stengel et al., 2016) leverage the content-dependent information to further improve the foveation effect. The second category differentiates the methods based on the underlying analytical models for peripheral degradation. While all the methods reduce the sampling rate with an increase in distance from the focus point, the reduction rate is different across methods. The best-suited model is selected based on the target task and scene complexity. We next differentiate methods according to the space where the foveation effect is employed: object-space, screen-space, and optics-space. With the evolution of the graphics pipeline and dedicated hardware, there is an increasing trend to incorporate foveation in the screen or the image space. More recently, several works have modified the display at the lens level to incorporate foveation. These optics-based foveated methods mitigate the hardware limitation in maintaining high pixel densities to match the angular resolution of the HVS. We also separate the methods based on the amount of gaze information used in their initial proposed solution. With improvement in eye-tracking technology, more and more works tend to develop and study foveated systems using gaze point information. As the eye-tracking technology improves in both accuracy and latency, dynamic foveated rendering seems to be the way to achieve real-time rendering with high perceptual quality. We also discuss anti-aliasing approaches that complement foveated rendering. These include deep-learning-based approaches that produce realistic, anti-aliased, foveated output images from sparse inputs. It is to be noted that the provided categorization is not disjoint, and a solution can combine several options in each category, for example, including several visual factors or combining screen-space and optics-space foveation methods.
We also provided a brief overview of the human visual system. However, the way the human brain processes information from the images formed on the retina is not yet fully understood. As the neuroscience community advances our understanding of visual perception and cognition, their studies can guide our efforts towards minimal and sufficient rendering without compromising human perception and cognition.
The ultimate goal is to build head-mounted displays that can simultaneously provide wide FOV and high resolution that match human perceptual capabilities. Visual acuity is also limited by diffraction, aberrations of the eye lens, in addition to the density of the photoreceptor cells (Smith, 1997). In addition, even factors such as refractive error, illumination, contrast, and location of the retina being stimulated affect the perceived image quality. Studies also show that the photoreceptor density decreases with age (Panda-Jonas et al., 1995). So, these factors are to be considered while studying the effectiveness of a foveated rendering system. Interestingly, the foveation characteristics differ even between the two eyes. Meng et al. (2020b) leverages the concept of eye dominance, which states that the human visual system prefers visual stimuli from one eye more than the other. This suggests that the non-dominant eye can permit greater foveation than the dominant eye without any perceptual difference. Similarly, given that only rods are active for night vision can be used to render scenes under dim lighting differently than those with bright lighting and at a much lower computational cost. Another possible direction of research is to leverage the color discrimination ability of the human eye across the visual field, which decreases with an increase in eccentricity. Further, much of the current effort in building foveated rendering system is for a personalized single user system; foveated rendering systems for shared multi-user displays is another exciting area for future research.
A few recent works (Tan et al., 2018a; Kim et al., 2019; Lee et al., 2019, 2018) use the concept of foveation to develop near-eye foveated displays. Kim et al. (2019) use a deep-learning-based gaze-tracking system (Kim et al., 2019) for pupil center estimation to move the small foveal display appropriately. Such systems can also benefit from estimating the future gaze direction based on the current frame and gaze information. Improving the accuracy of the saliency models based on long-term memory and the information from the current frame can guide the renderer to render appropriate sections at a high resolution. As deep-learning models are increasingly able to provide better results in several tasks, they can also be employed to improve the gaze and saliency estimation tasks.
Although there has been significant progress in foveated rendering, it is still challenging to realize the full potential of foveated rendering, especially in complex scenes with photo-realistic lighting. Further, there is an urgent need to develop a uniform evaluation metric that uses perceptual criteria to quantify the quality of the foveated rendering and allows us to compare various foveation methods. We believe that the combination of developments in both software and hardware will make foveated rendering more dominant in every commodity headset.
Acknowledgements.This work has been supported in part by the NSF Grants 15-64212, 18-23321, and the State of Maryland’s MPower initiative. The authors would like to thank Alexandar Rowden, Anshul Shah, Sweta Agrawal and Yogesh Balaji for their helpful suggestions.
- Isoeccentric locations are not equivalent: the extent of the vertical meridian asymmetry. Vision Research 52 (1), pp. 70–78. External Links: Cited by: §3.1.1.
- The contrast sensitivity function for detection and resolution of blue-on-yellow gratings in foveal and peripheral vision. Ophthalmic and Physiological Optics 22 (5), pp. 420–426. External Links: Cited by: §3.1.2.
- A chart demonstrating variations in acuity with retinal position. Vision Research 14 (7), pp. 589–592. External Links: Cited by: §3.1.1.
- Speeding up the log-polar transform with inexpensive parallel hardware: graphics units and multi-core architectures. Journal of Real-Time Image Processing 10 (3), pp. 533–550. External Links: Cited by: §3.2.3.
- An introduction to the log-polar mapping. In Proceedings II Workshop on Cybernetic Vision, Vol. , pp. 139–144. Cited by: §3.2.3.
- Asymmetries in visual acuity around the visual field. Journal of Vision 21 (1), pp. 2–2. External Links: Cited by: §3.1.1.
- Formula for the contrast sensitivity of the human eye. In Image Quality and System Performance, Y. Miyake and D. R. Rasmussen (Eds.), Vol. 5294, pp. 231 – 238. External Links: Cited by: §3.1.2.
- Foveated pipeline for AR/VR head‐mounted displays. Information Display 33, pp. 14–19 and 35. External Links: Cited by: §3.2.4, §3.3.1, §3.5.1.
- Angular dependence of the spatial resolution in virtual reality displays. In 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), Vol. , pp. 836–841. External Links: Cited by: §4.
- Chapter 2 - The Human Visual System. In Communicating Pictures, D. R. Bull (Ed.), pp. 17 – 61. External Links: Cited by: §3.1.2.
- Foveated holographic near-eye 3d display. Opt. Express 28 (2), pp. 1345–1356. External Links: Cited by: §3.3.3.
- Eye tracking in virtual reality. Journal of Eye Movement Research 12, pp. . External Links: Cited by: §3.4.2.
- A platform for gaze-contingent virtual environments. In Smart Graphics (Papers from the 2000 AAAI Spring Symposium, Technical Report SS-00-04), pp. 66–70. Cited by: §3.3.2.
The triangle processor and normal vector shader: a VLSI system for high performance graphics. SIGGRAPH Comput. Graph. 22 (4), pp. 21–30. External Links: Cited by: §3.2.3.
- Time-warped foveated rendering for virtual reality headsets. Computer Graphics Forum 40 (1), pp. 110–123. External Links: Cited by: §3.5.1.
- Perceptual rasterization for head-mounted display image synthesis. ACM Trans. Graph. 38 (4), pp. 97:1–97:14. External Links: Cited by: §3.2.4.
- Foveated real-time ray tracing for virtual reality headset. Light Transport Entertainment Research. Cited by: §3.1.1, §3.2.4, Table 2.
- Adaptive display algorithm for interactive frame rates during visualization of complex virtual environments. In Proceedings of the 20th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’93, New York, NY, USA, pp. 247–254. External Links: Cited by: §3.4.1, Table 2.
- Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, Cambridge, MA, USA, pp. 2672–2680. External Links: Cited by: §3.5.2.
- Foveated 3D graphics. ACM Trans. Graph. 31 (6), pp. 164:1–164:10. External Links: Cited by: §1, Figure 6, Figure 9, §3.1.1, §3.2.1, §3.2.1, §3.2.1, §3.2.2, §3.2.4, §3.3.1, §3.4.2, §3.5.2, Table 2, §4.2.
- Perceptual quality assessment of immersive images considering peripheral vision impact. arXiv preprint arXiv:1802.09065. Cited by: §4.2.
- Color perception in the intermediate periphery of the visual field. Journal of Vision 9 (4), pp. 26–26. External Links: Cited by: §3.
- Color perception in the intermediate periphery of the visual field. Journal of Vision 9 (4), pp. 26–26. External Links: Cited by: §3.1.2.
- Extending the graphics pipeline with adaptive, multi-rate shading. ACM Trans. Graph. 33 (4), pp. 142:1–142:12. External Links: Cited by: §3.1.1, §3.1.2, §3.2.1, §3.3.1, §4.2.
- 65-2: sensitivity to peripheral artifacts in VR display systems. SID Symposium Digest of Technical Papers 49 (1), pp. 858–861. External Links: Cited by: §3.5, §3.5.
- Limits of peripheral acuity and implications for VR system design. Journal of the Society for Information Display 26 (8), pp. 483–495. External Links: Cited by: §3.5.2, §3.5.
Perception, attention, and resources: a decision-theoretic approach to graphics rendering.
Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence, UAI’97, San Francisco, CA, USA, pp. 238–249. External Links: Cited by: §3.4.1.
- Is foveated rendering perceivable in virtual reality?: exploring the efficiency and consistency of quality assessment methods. In Proceedings of the 25th ACM International Conference on Multimedia, MM ’17, New York, NY, USA, pp. 55–63. External Links: Cited by: §4.1.
- DGaze: CNN-Based Gaze Prediction in Dynamic Scenes. IEEE Transactions on Visualization and Computer Graphics 26 (5), pp. 1902–1911. Cited by: §3.4.2.
- A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (11), pp. 1254–1259. Cited by: §3.4.1.
- Foveated computer-generated hologram and its progressive update using triangular mesh scene model for near-eye displays. Opt. Express 27 (17), pp. 23725–23738. External Links: Cited by: §3.3.3.
- Deepfovea: neural reconstruction for foveated rendering and video compression using learned natural video statistics. In ACM SIGGRAPH 2019 Talks, SIGGRAPH ’19, New York, NY, USA, pp. 58:1–58:2. External Links: Cited by: §3.5.2.
- High-quality temporal supersampling. Advances in Real-Time Rendering in Games, SIGGRAPH Courses 1, pp. 1–55. Cited by: §3.5.2.
- Foveated AR: dynamically-foveated augmented reality display. ACM Trans. Graph. 38 (4), pp. 99:1–99:15. External Links: Cited by: §3.3.3, Table 2, §5.
- NVGaze: an anatomically-informed dataset for low-latency, near-eye gaze estimation. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, New York, NY, USA, pp. 550:1–550:12. External Links: Cited by: §5.
- Retinal ganglion cells—diversity of cell types and clinical relevance. Frontiers in Neurology 12, pp. 635. External Links: Cited by: §2.
- Mesh saliency and human eye fixations. ACM Trans. Appl. Percept. 7 (2). External Links: Cited by: §3.1.2, §3.4.1.
- Blockwise multi-order feature regression for real-time path tracing reconstruction. ACM Transactions on Graphics (TOG) 38 (5). External Links: Cited by: §3.2.3.
- Foveated Real-Time Path Tracing in Visual-Polar Space. In Eurographics Symposium on Rendering - DL-only and Industry Track, T. Boubekeur and P. Sen (Eds.), External Links: Cited by: §3.1.1, §3.2.3, Table 2.
- Mesh saliency. In ACM SIGGRAPH 2005 Papers, SIGGRAPH ’05, New York, NY, USA, pp. 659–666. External Links: Cited by: §3.1.2, §3.4.1.
- Enhanced see-through near-eye display using time-division multiplexing of a maxwellian-view and holographic display. Opt. Express 27 (2), pp. 689–701. External Links: Cited by: §5.
- Foveated retinal optimization for see-through near-eye multi-layer displays. IEEE Access 6 (), pp. 2170–2180. External Links: Cited by: §5.
- Foveated near-eye display for mixed reality using liquid crystal photonics. Scientific Reports 10 (1), pp. 1–11. Cited by: §3.3.3, Table 2.
- Adler’s Physiology of the Eye E-Book: expert consult-online and print. Elsevier Health Sciences. Cited by: §1, §3.1.2.
- Gaze-directed volume rendering. SIGGRAPH Comput. Graph. 24 (2), pp. 217–223. External Links: Cited by: §3.3.2.
- A log-rectilinear transformation for foveated 360-degree video streaming. IEEE Transactions on Visualization and Computer Graphics (), pp. . External Links: Cited by: §3.2.4.
- Spatialchromatic foveation for gaze contingent displays. In Proceedings of the 2008 Symposium on Eye Tracking Research & Applications, ETRA ’08, New York, NY, USA, pp. 139–142. External Links: Cited by: §3.1.2.
- The contributions of central and peripheral vision to scene-gist recognition with a 180° visual field. Journal of Vision 19 (5), pp. 15–15. External Links: Cited by: §2.
- Real-time ray tracing using Nvidia OptiX. In Eurographics 2010 - Short Papers, H. P. A. Lensch and S. Seipel (Eds.), External Links: Cited by: §3.
- Perceptually driven simplification for interactive rendering. In Proceedings of the 12th Eurographics Conference on Rendering, EGWR’01, Aire-la-Ville, Switzerland, Switzerland, pp. 223–234. External Links: Cited by: §3.3.2, Table 2.
- Holographic near-eye displays for virtual and augmented reality. ACM Trans. Graph. 36 (4). External Links: Cited by: §3.3.3.
- FovVideoVDP: a visible difference predictor for wide field-of-view video. ACM Trans. Graph. 40 (4). External Links: Cited by: §4.2.
- HDR-VDP-2: a calibrated visual metric for visibility and quality predictions in all luminance conditions. ACM Trans. Graph. 30 (4), pp. 40:1–40:14. External Links: Cited by: §4.2.
- The detection of motion in the peripheral visual field. Vision Research 24 (1), pp. 25 – 32. External Links: Cited by: §3.5.
- 3D-Kernel Foveated Rendering for Light Fields. IEEE Transactions on Visualization and Computer Graphics (), pp. 1–1. External Links: Cited by: §1.
- Eye-dominance-guided foveated rendering. IEEE Transactions on Visualization and Computer Graphics (), pp. 1–1. External Links: Cited by: §5.
- Kernel foveated rendering. Proc. ACM Comput. Graph. Interact. Tech. 1 (1), pp. 5:1–5:20. External Links: Cited by: Figure 8, Figure 9, §3.1.1, §3.2.3, §3.2.4, §3.3.1, §3.5.2, Table 2, §4.2.
- An integrative view of foveated rendering. Computers & Graphics 102, pp. 474–501. External Links: Cited by: §1.
- Temporally adaptive shading reuse for real-time rendering and virtual reality. ACM Trans. Graph. 40 (2). External Links: Cited by: §3.5.1.
- Gaze-contingent level of detail rendering. EuroGraphics 2001, pp. . Cited by: §1, §3.3.2, Table 2.
- Oculus go: fixed foveated rendering. External Links: Cited by: §3.4.1.
- Gaze-directed adaptive rendering for interacting with virtual space. In Proceedings of the IEEE 1996 Virtual Reality Annual International Symposium, Vol. , pp. 103–110. External Links: Cited by: §3.3.2.
- Retinal photoreceptor density decreases with age. Ophthalmology 102 (12), pp. 1853–1859. External Links: Cited by: §5.
- Perceptually-based foveated virtual reality. In ACM SIGGRAPH 2016 Emerging Technologies, SIGGRAPH ’16, New York, NY, USA, pp. 17:1–17:2. External Links: Cited by: §3.1.2, §5.
- Towards foveated rendering for gaze-tracked virtual reality. ACM Trans. Graph. 35 (6), pp. 179:1–179:12. External Links: Cited by: Figure 1, §1, Figure 5, §3.1.2, §3.2.1, §3.3.1, §3.4.2, §3.5.2, §3.5.2, §3.5, Table 2, §3, §4.2, §4.
- The contrast sensitivity gradient across the human visual field: with emphasis on the low spatial frequency range. Vision Research 29 (9), pp. 1133 – 1151. External Links: Cited by: §3.1.2.
- Neuroscience. 2nd edition. Sunderland (MA): Sinauer Associates. External Links: Cited by: §2, §3.
- Specification and evaluation of level of detail selection criteria. Virtual Reality 3 (2), pp. 132–143. Cited by: §3.2.4.
- Perceptually optimized 3D graphics. IEEE Comput. Graph. Appl. 21 (5), pp. 68–75. External Links: Cited by: Figure 9, §3.2.4, §3.4.2, Table 2.
- NVIDIA multi resolution rendering. External Links: Cited by: §4.2.
- Foveated mean squared error–a novel video quality metric. Multimedia Tools Appl. 49 (3), pp. 425–445. External Links: Cited by: §4.2.
- Foveation-based content adaptive structural similarity index. In 2011 18th International Conference on Systems, Signals and Image Processing, pp. 1–4. Cited by: §4.2.
- High-resolution inset head-mounted display. Appl. Opt. 37 (19), pp. 4183–4193. External Links: Cited by: §3.3.3.
- Capabilities and limitations of peripheral vision. Annual Review of Vision Science 2 (1), pp. 437–457. Note: PMID: 28532349 External Links: Cited by: §3.
- Foveated video quality assessment. IEEE Transactions on Multimedia 4 (1), pp. 129–132. External Links: Cited by: §4.2.
- The eye and visual optical instruments / george smith, david a. atchison.. Cambridge University Press, Cambridge ; (eng). External Links: Cited by: §5.
- Mesh saliency via spectral processing. ACM Trans. Graph. 33 (1). External Links: Cited by: §3.1.2.
- Toward standardized classification of foveated displays. IEEE Transactions on Visualization and Computer Graphics 26 (5), pp. 2126–2134. External Links: Cited by: §1, §3.2.1.
- Foveated displays: toward classification of the emerging field. In ACM SIGGRAPH 2019 Talks, SIGGRAPH ’19, New York, NY, USA, pp. 57:1–57:2. External Links: Cited by: §3.1.1.
- Adaptive image-space sampling for gaze-contingent real-time rendering. Comput. Graph. Forum 35 (4), pp. 129–139. External Links: Cited by: §3.1.2, §3.2.1, §3.2.1, §3.2.1, §3.4.2, Table 2, §4.2, §5.
- Perceptually-guided foveation for light field displays. ACM Trans. Graph. 36 (6), pp. 192:1–192:13. External Links: Cited by: §1.
- Investigating perception time in the far peripheral vision for virtual and augmented reality. In Proceedings of the 15th ACM Symposium on Applied Perception, SAP ’18, New York, NY, USA, pp. 13:1–13:8. External Links: Cited by: §3.
- User, metric, and computational evaluation of foveated rendering methods. In Proceedings of the ACM Symposium on Applied Perception, SAP ’16, New York, NY, USA, pp. 7–14. External Links: Cited by: §3.1.1, §3.2.1, §3.3.1, §3.3.2, §3.4.2, Table 2, §4.2, §4.2, §4.2.
- Foveated imaging for near-eye displays. Opt. Express 26 (19), pp. 25076–25085. External Links: Cited by: §5.
- Foveated imaging for near-eye displays. Opt. Express 26 (19), pp. 25076–25085. External Links: Cited by: §3.3.3.
- Handbook of retinal screening in diabetes: diagnosis and management. Wiley. External Links: Cited by: Figure 3, §2.
- Characterization of spatial aliasing and contrast sensitivity in peripheral vision. Vision Research 36 (2), pp. 249 – 258. External Links: Cited by: §3.1.2.
- Foveation-based image quality assessment. In 2014 IEEE Visual Communications and Image Processing Conference, Vol. , pp. 25–28. External Links: Cited by: §4.2.
- Phase-aligned foveated rendering for virtual reality headsets. In 2018 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), Vol. , Los Alamitos, CA, USA, pp. 1–2. External Links: Cited by: Figure 12, §3.5.1, Table 2.
- Luminance-contrast-aware foveated rendering. ACM Trans. Graph. 38 (4), pp. 98:1–98:14. External Links: Cited by: §3.1.1, §3.1.2, §3.4.2, Table 2, §4.2, §5.
- College physics. OpenStax. External Links: Cited by: Figure 2.
- Coarse pixel shading. In Proceedings of High Performance Graphics, HPG ’14, Goslar Germany, Germany, pp. 9–18. External Links: Cited by: §3.1.1, §3.2.1, §3.3.1.
-  Multi-res shading. Note: ”[Online; accessed 6-December-2021]” External Links: Cited by: §3.1.1.
- Image quality assessment: from error visibility to structural similarity. Trans. Img. Proc. 13 (4), pp. 600–612. External Links: Cited by: §4.2.
- Foveated wavelet image quality index. In Applications of Digital Image Processing XXIV, A. G. Tescher (Ed.), Vol. 4472, pp. 42 – 52. External Links: Cited by: §4.2.
- Retinal ganglion cell density and cortical magnification factor in the primate. Vision Research 30 (11), pp. 1897–1911. Note: Optics Physiology and Vision External Links: Cited by: §2.
- Fast calculation method with foveated rendering for computer-generated holograms using an angle-changeable ray-tracing method. Appl. Opt. 58 (5), pp. A258–A266. External Links: Cited by: §3.3.3.
- Perception-driven accelerated rendering. Comput. Graph. Forum 36 (2), pp. 611–643. External Links: Cited by: §2, Table 2.
- Foveated Depth-of-Field Filtering in Head-Mounted Displays. ACM Trans. Appl. Percept. 15 (4), pp. 26:1–26:14. External Links: Cited by: §3.5.2, Table 2, §4.2.
- Foveated real-time ray tracing for head-mounted displays. Comput. Graph. Forum 35 (7), pp. 289–298. External Links: Cited by: Figure 7, Figure 9, §3.1.1, §3.2.2, §3.2.4, §4.2.
- Visual sensory units and the minimal angle of resolution*. American Journal of Ophthalmology 46 (1, Part 2), pp. 102 – 113. External Links: Cited by: §3.2.1.
- Prescription ar: a fully-customized prescription-embedded augmented reality display. Opt. Express 28 (5), pp. 6225–6241. External Links: Cited by: §3.3.3.
Gaze prediction in dynamic 360° immersive videos.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.4.2.
- State of the art in holographic displays: a survey. J. Display Technol. 6 (10), pp. 443–454. External Links: Cited by: §3.3.3.
- Rectangular mapping-based foveated rendering. In 2022 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), Vol. , pp. 756–764. External Links: Cited by: §3.1.1.
- Spatiotemporal sensitivity and visual attention for efficient rendering of dynamic environments. ACM Trans. Graph. 20 (1), pp. 39–65. External Links: Cited by: §3.4.1.
- Saliency preservation in low-resolution grayscale images. In Computer Vision – ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss (Eds.), Cham, pp. 237–254. Cited by: §3.1.2.
- Foveated display system based on a doublet geometric phase lens. Opt. Express 28 (16), pp. 23690–23702. External Links: Cited by: §3.3.3, Table 2.
- Design and applications of a high-resolution insert head-mounted-display. In Proceedings Virtual Reality Annual International Symposium ’95, Vol. , pp. 84–93. External Links: Cited by: §3.3.3.
- Optical design and analysis of a head-mounted display with a high-resolution insert. In Novel Optical Systems Design and Optimization, J. M. Sasian (Ed.), Vol. 2537, pp. 71 – 82. External Links: Cited by: §3.3.3.
- Attention driven foveated video quality assessment. IEEE Transactions on Image Processing 23 (1), pp. 200–213. External Links: Cited by: §4.2.
- Perceptual model optimized efficient foveated rendering. In Proceedings of the 24th ACM Symposium on Virtual Reality Software and Technology, VRST ’18, New York, NY, USA, pp. 109:1–109:2. External Links: Cited by: §3.2.4, §3.3.2, Table 2, §4.2.