Massively Parallel Ray Tracing Algorithm Using GPU

by   Yutong Qin, et al.

Ray tracing is a technique for generating an image by tracing the path of light through pixels in an image plane and simulating the effects of high-quality global illumination at a heavy computational cost. Because of the high computation complexity, it can't reach the requirement of real-time rendering. The emergence of many-core architectures, makes it possible to reduce significantly the running time of ray tracing algorithm by employing the powerful ability of floating point computation. In this paper, a new GPU implementation and optimization of the ray tracing to accelerate the rendering process is presented.



There are no comments yet.


page 4

page 5


Running on Raygun

With the introduction of Nvidia RTX hardware, ray tracing is now viable ...

Real Time Cluster Path Tracing

Photorealistic rendering effects are common in films, but most real time...

A Distributed, Decoupled System for Losslessly Streaming Dynamic Light Probes to Thin Clients

We present a networked, high performance graphics system that combines d...

Piko: A Design Framework for Programmable Graphics Pipelines

We present Piko, a framework for designing, optimizing, and retargeting ...

Efficient Animation of Sparse Voxel Octrees for Real-Time Ray Tracing

A considerable limitation of employing sparse voxels octrees (SVOs) as a...

A Method for Rigid-Body Animation of Sparse Voxel Octrees for Use in Ray Tracing

One of the main limitations today when using ray tracing to render spars...

Hash-Based Ray Path Prediction: Skipping BVH Traversal Computation by Exploiting Ray Locality

State-of-the-art ray tracing techniques operate on hierarchical accelera...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Photorealistic rendering is an rendering process of the reflection effects of real shadow rays. Unlike the pipeline of real-time rendering, it requires to achieve the high quality of reality to guarantee its authenticity is hard to verify, thus realistic illumination and materials need quite complicated and accurate simulation. Physics-based rendering technology can achieve photo-realistic rendering, but the huge computational cost makes real-time photorealistic rendering of a image can not be generated in time. On the contrary of the pipeline rendering, the former sacrifices reality for high-speed rendering and real-time performance. The latter attenuates high-speed rendering and real-time performance to dramatically enhance the effect of reality. Because of these properties of ray tracing, it has been widely applied in film, advertising, animation, and other visual industries. Ray tracing is other than the widely used technique in interactive computer graphics, rasterization. Based on physical optics theorem, ray tracing can simulate the light propagation in the real world and calculate the distribution of radiation. Because of the heavy computational complexity of simulating the light propagation, rendering an image usually takes tens of minutes to several hours, so to product the high-quality real images, we generally requires specialized high-performance equipment. Before the GPU computing was proposed, ray tracing technique has always been a very time consuming work.

In recent years, the emergence of parallel computing based on GPU architectures, many researchers are interested in employing the powerful ability of floating point computation to improve the efficiency of ray tracing algorithm because of the low entry threshold. Unlike the design philosophy of CPU architecture, GPU is generally comprised of hundreds of thousands of stream processors. Many-core architecture is split into a large number of much smaller cores and each core is an in-order, heavily multi-threaded, single-instruction issue processor that shares its control and instruction cache with other cores. So data-intensive applications can easily harness the potential power of GPUs. Because there are a large number of calculation in ray tracing algorithm, for example, traverse, circulation and intersection, all of these calculation can be decomposed into independent subtasks to execute in parallel. It is not difficult to imagine how the ray tracing’s performance varies under GPU architecute.

In modern software processes, the program sections often exhibit a rich amount of data parallelism, a property that allows many arithmetic operations to be performed on program data structures in a simultaneous manner. CUDA devices accelerate the execution of these applications by obtaining a large amount of data parallelism. Besides CUDA, several tools including language, library, and compiler directives are still used. For example, OpenCL, which is a framework for writing programs, can be executed across heterogeneous platforms consisting of CPUs, GPUs, digital signal processors (DSPs), and other processors. Considering good characteristics of OpenCL, such as flexibility, portability, versatility, we used OpenCL to optimize and accelerate ray tracing algorithm.

Ii The problem

Since the vast majority of ray tracing applications today perform on CPU architecure, it makes the efficiency of ray tracing have direct relation with Cycles Per Instruction (CPI) and cycle rate. CPI is determined by Instruction Set Architecture (ISA). Because of the bottleneck of Moore’s law, CPU manufacturers have gradually reached the limit of clock frequency. Thus, serial program can not essentially improve the efficiency of ray tracing. However, today it has not taken a gigantic leap forward even in multi-core CPU architecture.

To solve these problems, many researchers designed lots of the acceleration of ray tracing algorithm, including space partition, bounding box, spatial sorting, and so forth. Because these methods exclude those objects and lights who do not involve in ray tracing, the optimized scene do greatly reduce the time overhead of ray tracing. But, more or less, every optimization method has limitations. For example, space partition’s efficiency is generally limited by intensive scenes.

On the other hand, there are hardwares specifically designed for ray tracing. For example, light processing unit developed by Stanford, but poor universality, only a few people can use these dedicated hardwares. Another solution is distributed computing using cluster. It splits the problem into independent subproblem and these tasks will be mapped into the different computer nodes. The cost of that is significant, in the meantime, it’s extremely hard to guarantee load balancing.

It is becoming increasingly common to use a general purpose graphics processing unit as a modified form of stream processor. This concept turns the massive computational power of a modern graphics accelerator’s shader pipeline into general-purpose computing power. GPU can be used for many types of embarrassingly parallel tasks including ray tracing. They are generally suited to high-throughput type computations that exhibit data-parallelism to exploit the wide vector width SIMD architecture of the GPU.

In general, GPU allows to launch tens of thousands of lightweight threads to execute the same kernel function simultaneously. with this feature, independent lightweight threads can take the place of multi-level iterations and massively parallel ray tracing algorithm. So GPU can greatly improve the efficiency of ray tracing.

Iii Ray tracing

In computer graphics, ray tracing is a technique for generating an image by tracing the path of light through pixels in an image plane and simulating the effects of its encounters with virtual objects. If the ray intersects with some objects, according to the theorem of radiosity, the color value of the related point in the image plane can be calculated by this method using some parameters, for example, materials, normal vector at the intersection point, light distribution, and so on. More specifically, to get the color value at one point, it is a critical part to calculate the radiance of the opposite direction of the ray casting at this point.

Fig. 1: In the radiosity model, and represent the directions of incident light and emergent light.

As shown in Fig. 1, point p is an random point on the object surface. It’s the origin of an eclipse and that eclipse is the integration region of point p. By convention, points to light source or one sampling point on its surface, and can finally reach the viewing plane. Set the radiance along the reflect light of to . Through the Lambert’s emission law, the equation is derived as follows:


As Eq. (1) shows, means the radiation power which emits from the surface element to the solid angle . Through the formula of irradiance:


In considering of the premise of incident direction, Eq. (1) is substituted into Eq. (2) as follows:


In Eq. (3), the received irradiance at the point can be calculated by the radiance at that point. Obviously, incident angle is the other impact factor to the final result. For general materials, irradiance is proportional to radiance, that is, with greater radiosity, comes greater reflection of radiosity at the same point. Thus, the following relation holds certainly:


If bidirectional reflectance distribution function (BRDF) is used to define the scale factor, Eq. (4 can be transformed as Eq. (5):


And then, Eq. (3) is substituted into Eq. (5) as following:


If the surface of object is self illuminated material, besides the reflection of radiosity, the surface emits radiance also include it emits radiosity by itself. Set self illuminated material emits radiance to . is added into Eq. (6) as below:


As shown in Fig. 1, assume that consider only the single incident direction , Eq. 7 can calculate the integration of radiance in any directions. However, it’s impossible that the irradiance at point simply originate from single direction. In reality, point would receive irradiance of all directions in the hemisphere region above that point. Radiance is obtained by integration of Eq. (7) as follows:


Although Eq. (8) provides the equation to calculate the whole radiance in the surface of objects, apparently it can’t be solved for straight away. There are a couple of reasons for this. First, Eq. (8) contains a constant integral limitation which can be seen as Fredholm integral equation of the second kind. Second, because computer can not precisely simulate irradiance of all directions in the hemisphere region. Even in the global illumination model, it is unable to trace all the lights at one point of object’s surface. Thus, the mathematical model described in Eq. (8) should be simplified. We can recursively trace a small amount of indirect reflected light on object’s surface. Recursion depth depends on the number of light reflection. So the majority of integral calculation is concentrated on radiosity of sampling points on the surface of light source, as shown in Fig. 2.

Fig. 2: In local illumination model, source lights all have radiosity effects on point .

In Fig. 2, to get the radiance along light to viewing plane at point , calculating the received irradiance of that point using Eq. (8) is necessary. Point can receive the whole radiosity from light source no.1 and partial that from no.2. The process of integration need to traverse all sampling points on the surface of both regions and determine one by one whether the light is obstructed by objects. For example, the object in Fig. 2 blocked some radiosity from light source no.2. The blocked radiation did not make a contribution to the lighting of point at all. Afterwards, integrating the received radiosity at point . This process is generally the most time-consuming part of ray tracing which depends on the number of light sources and geometries, the intersection complexity of geometries, the number of sampling points on the surface of light source and so on. If the process of rendering using anti-aliasing technology, each pixel will cast more light and finally the pixel will take the average value of these colors. The pseudocode of local illumination ray tracing can be depicted as follows:

1:  for each light of each pixel in the scene do
2:     for each object in the scene do
3:        for one light intersects with one object do
4:           for each sampling point of each source light do
5:              emit a shadow light from point to that sampling point
6:              for each object in the scene do
7:                 if  intersects with one object then
8:                    break
9:                 else
10:                    calculate the irradiance using Eq. (8)
11:                 end if
12:                 accumulate all the received irradiance
13:              end for
14:           end for
15:        end for
16:        accumulate the color value of each light
17:     end for
18:     take the average value of these pixel’s colors
19:  end for
Algorithm 1 Local Illumination Ray Rracing

As shown in Alg. 19, multilevel nest iterations exhibit a rich amount of data parallelism. The pseudocode only considers the radiosity point p received directly. In the global illumination, besides radiosity from source light, it also includes reflection radiosity from objects, so the program need to be modified as an recursive version. However, the performance of serial execution is inefficient.

Iv Paralel optimization

Iv-a parallel ray tracing

In traditional global illumination model, when a single light intersects with object in the scene, it will produce some of secondary lights. Some secondary are shadow lights which can be used to check the visibility of light sources. Besides that, all the others are treated as new generation lights to spread again (intersection test and radiosity calculation), as shown in Fig. 3.

Fig. 3: Lights occurr radiosity on the other objects through reflection and refraction

Recursion method is used to trace secondary lights until they reach the maximum recursion depth. Secondary lights occurr radiosity on the other objects, so global illumination is also called indirect illumination.

Since OpenCL kernel don’t support the property of recursion, recursion need to be transformed into iterations and the number of iterations is used to simulate the recursion depth. When a single light reaches a point on the surface of one object, derived shadow lights at intersection only need to sample every light source once. They traverse all the sampling points of each light source is unnecessary. when all the lights recursively sample the surface of light source just once, the process of rendering will be suspended and the image will be updated. The next ray tracing will select another sampling point and start the same work at once. Then overlapping new color value onto the pixel. Iterations are to simulate the integration of the radiosity of sampling points on the light source’s surface.

In Fig. 4, under GPU architecute, each kernel thread traces a single light and it can obtain the final color value of the light. When all threads execute the kernel function once, the intermediate value will be added into the pixels. To render a image, the same kernel function should be launched iteratively.

Fig. 4: Overview of parallel ray tracing algorithm using GPU

Iv-B GPU Kernel Function

To simplify the programming model, this paper only study the rendering of sphere. The implicit equation of sphere can be represented in vectorial form.


Linear equation can be expressed as below:


Eq. (10) is substituted into Eq. (9) as following:


Eq. (11) can be regarded as a quadratic equation. So t is a dependent variable, the formula can be transformed as follows:


Note that the variables , and can be calculated as below: , and .

Eq. 12 determines whether a single light intersects with sphere. If so, the coordinate of intersection can be calculated. To calculate the process of intersection more efficiently, We need to transform the equation into OpenCL kernel function. In combination with Eq. (8), massively parallel integration can achieve the goal of improving the efficiency of rendering.

V Results and Discussion

Tests were conducted on a system composed of an Intel Core i7-2720QM CPU running at 2.20GHz, with 1600MHz and 4GB DDR3 DRAM. This platform also had a ATI Radeon HD 6750M GPU. The scene file provided by David Bucciarelli and the scene resolution is 640 480.

Fig. 5: The first rendering took only 0.508 seconds to generate the image.

In Fig. 5, the image was generated using local illumination model while a single cycle of rendering was finished. Since parallel rendering once only selects one sampling point, partial region of the image produced amounts of black noise. When more and more cycles are completed, sampling points will cover most of the pixels in the scene, thus, the image will show better rendering effects (see Fig. 6). As time goes on, more sampling points will be rendered, the image will become more accurate.

In Fig. 7, the image was generated using global illumination model in the same scene. Its recursion depth was 6 and it took 20 seconds to generate this image. The experimental result shows that parallel ray tracing based on GPU significantly improves rendering effects. Here, as shown in Fig. 8, a comparative evaluation of ray tracing to process the same number of sampling points under two different platforms, multi-core (i7-2720QM CPU) and many-core (ATI Radeon HD 6750M GPU) is proposed.

Fig. 6: After 6 seconds, the image showed better rendering effects.
Fig. 7: Overview of radiosity model,
Fig. 8: Overview of radiosity model,

Vi Conclusion


This work was completely supported by the department of Civil Engineering and the department of Computer ScienceEngineering Jinjiang College, Sichuan University. All authors read and approved the final manuscript. We are deeply indebted to somebody for their encouragement and support.


  • [1] Suffern, Kevin Geoffrey and Kevin Suffern, Ray Tracing from the Ground up, AK Peters, 2007.
  • [2] Pharr, Matt and Greg Humphreys, Physically based rendering: From theory to implementation, Morgan Kaufmann, 2010.
  • [3] Weiping, Li Hongning Feng Jie Yang, and Bai Fengxiang, Spectral-Based Rendering Method and Its Application in Multispectral Color Reproduction, Laser & Optoelectronics Progress 12 (2010): 020.
  • [4] Christen, Martin, Ray tracing on GPU, Master’s thesis, Univ. of Applied Sciences Basel (FHBB), Jan 19 (2005).
  • [5] Akenine-Moller, Tomas, Eric Haines and Naty Hoffman, Real-time rendering, AK, 2002.
  • [6] Macbeth color checker patches data, Munsell Color Science Laboratory,
  • [7] Munshi, Aaftab, Benedict Gaster, Timothy G. Mattson and Dan Ginsburg, OpenCL programming guide, Pearson Education, 2011.
  • [8] articledutre2003global, title=Global illumination compendium, author=Dutré, Philip, journal=Computer Graphics, Department of Computer Science Katholieke Universiteit Leuven, year=2003
  • [9] inproceedingskajiya1986rendering, title=The rendering equation, author=Kajiya, James T, booktitle=ACM Siggraph Computer Graphics, volume=20, number=4, pages=143–150, year=1986, organization=ACM
  • [10] bookvan2014computer, title=Computer graphics: principles and practice, author=Van Dam, Andries and Feiner, Steven K, year=2014, publisher=Pearson Education