Saturday, August 16, 2008

Cuda real time ray tracing - 100 millions ray/s?


**Update** This post is somehow outdated. The code of Radius-Cuda may be found on the internet I think. Look at ompf.org or similar places. On my side, I lost the code. Sorry :-(

Ray Tracing!
There is a big buzz around these words nowadays and flamewars are everywhere. What I can however say is that I spent a long time writing ray tracers, Monte-Carlo renderers and: it is fun!

I recently decided to write a ray tracer using Cuda. There were two reasons for this. The first one is that I never wrote one line of Cuda. The second one is that ray tracing was said to be slow on G80 graphics cards due the heavy use of caches and stacks during the accelaration structure traversal generally required by ray tracers.


As usual, I first fetched all the articles I could find on the net, downloaded Cuda, installed it, runned some samples and began to code the thing. As I needed a minimalist ray tracing framework, I decided to drop my old ray tracer, yaCORT, and installed Radius, a small but pretty fast ray tracer developed by Thierry Berger-Perrin. As I had moreover already developed a Bounding Volume Hierarchy on it, it was perfect to inject a Cuda ray tracer with it.

I therefore decided to write a kd-tree traversal and first results were disappointing. Fortunately, with the help of Brian Budge (who posts as me, on the uber-excellent ray tracing forum ompf.org), I managed to get quite good performance.
  • Summary of the implementation
  1. a stack per ray: no need to synchronize rays using a lot of __syncthreads(). The cost is that the stack will be necessary on local memory (slow) but we do not need anymore to handle ray packets or things like that as done with shared stacks or SIMD ray tracing. The Cuda compiler handle itself the SIMD packing and the branches in the code.
  2. Everything is in textures to cache the access to the kd-tree nodes, the triangles and so on. This is much simpler and faster than coalesing accesses to global memory since no synchro are needed.
  3. KILL ALL UNIONS. Unions are bad even in C (and forbidden in C++ as soon as there is a : constructor). This command goes with this one: KILL ALL ARRAYS unless you put it on shared memory. The cuda compiler seems to put stack allocated array (stg like float my_array[4]) on local memory and accesses will just kill your app. It may work with indirection known at compile time but otherwise, this is just BAD.
  4. Use dirty switches. In my implemention, I need to perform circular permutations on the coordinates (due to the kd-tree traversal and the Baduel / Wald ray-tri intersection). I tested two approaches: an shared array strategy with a simple modulo computation where all the rays are put in shared memory, and a coding horror switch permutation (see the code). The second one was faster. The reason seems to be that in this case, a register allocation (allocating rays on the stack requires 6 more registers) is more efficient that a shared array allocation.
  • Performance
  1. The performance are quite good. I am currently developing on a laptop with the low end GeForce 8400GS but made some preliminary tests on a 8800 GT. With the last optimizations I made, I think that between 12 millions rays / s and 40 millions rays/s may be computed on a GeForce 8800GT on the demo I give in the code. I will test soon the code on a GTX280 and without being too optimistic, I think that 100 millions rays / sec may be generated on it. This may be one of the fastest ray tracers wrote on a single processor.
  2. The other good point is that the memory footprint is much smaller that a CPU ray tracer. Since the GPU is better at computing than at traversing large structures, a small kd-tree has the best performance. For 180,000 triangles, only 5MB are required for the kd-tree.
  • About Cuda
After the first deceptions due to my poor knowledge of the API and the compiler behaviour, I think that Cuda is a nice API, easy to use. G80 architectures are quite impressive and for a non-friendly application like ray tracing, speed is pretty good with no need to write one line of SIMD code. The bad point is that there is no way to have a real low-level access. After this nice experience, I am really eager to see what the upcoming Larrabee will be able to do with ray tracing. The good point is that Intel will certaintly provide both a high level API for fast prototyping (a la Cuda or the promising OpenCL) and a standard compiler / assembler since Larrabee cores are just x86s.

**Update** Source code is ... no more available on the previously posted link. I think you may find somewhere on ompf.org

8 comments:

tbp said...

How brutal. I like that.

Ported it right back to a Real OS, as it should.

bouliiii said...

Thanks. By the way, tuning the frame rate may be done easily by modifying the block sizes in the cuda file and tweaking the ld-tree construction by modifying the SAH parameters in kdlib.cc

tbp said...

Noted.
I had to do some (blind) surgery on the CUDA side but that's perhaps because you used a different version (CUDA 2.0 beta 2 here); of course i have no idea if it had any performance implication (i'd bet not but...).

The kd-tree construction stuff on GNA! is rather outdated and crummy; i have many variations on that theme, but would need to rethink them for that specific usage.

But then the beauty of what you did is in the KISS compliance. Hat's off.

makkes said...

@tbp: What about releasing your port? I would really like to see this one on my Linux box.

Thanks,
Max

BigJoe said...

what is ra2 file when importing "FairyForestF160.ra2" ?
I couldn't setup other scene due to that problem..

Alorwin said...

I can't get the source to compile under Visual Studio 2008's C++ Express

R2 said...

The source code is no longer available, can you please re-post?

rhushabh said...

Hello Bouliii,

The source code isnt available at the specified link. Could you please repost it? Thanks a ton!!