Sunday, August 31, 2008

Brute Force Tesselation and Quadratic Approximation of Subdivision Surfaces

**Update** This post is somehow outdated and I just lost this small demo. Sorry for that.

Hello there, next post!
Here no relationship with ray tracing; on the contrary, it is all about rasterization and tesselation.
First, some screen shots of the demo provided with this post:

The standard mesh with no tesselation (2408 triangles)

A close up on the tesselated surface in wireframe mode (10,000,000 triangles). About 200 millions triangles per second are displayed on my low-end 8400GS (Here the frame rate is lower due to the wireframe mode slow on Nvidia)

A simple procedural displacement is applied on the tesselated mesh

This technique actually relies on several techniques developed by Tamy Boubekeur.
  • Patch Tesselation
The first thing to do is to tesselate the input mesh. To achieve this, we simply replace each triangle by a patch, i.e. a set of barycentric coordinate triangles. Then, instead of displaying each triangle, we replace each of them by an instanciated patch. In the demo, we therefore create two vertex streams. The first one contains all the data for each triangle and another one simply contains the barycentric coordinates of the triangles of a patch. Then, the instanciation ability of Direct3D allows us to repeat the patch along the surface of the object. For more details about the technique, may one refer to the Graphics Hardware '05 article by Tamy here. However, in this article, no Direct3D instanciation is used.

So, why is this method good? Actually, if the patches are large enough (i.e. if the tesselation rate is high), the pre-transform and post-transform caches used by the GPUs become more and more efficient so that we directly get the maximum theoretical performance of the rasterizers. For example, the frame rate does not change when you use 64 times more triangles. In fact, we can expect the same level of performance for the GPUs which will support tesselation units.without the need to use any hand-coded patches. In my 8400GS, I got 200 millions triangles per second but we may easily expect 1 billion triangles per second on DirectX11 GPUs.

By the way, I use in the demo a simple optimization of the patch using the very good and simple indexation optimizer, "Tipsify" (you may find the reference here). This reorders the patch indexation in linear time using only triangle / point adjacency while maintaining a very good level of performance (and this is much simpler and much faster that stripification algorithms like TriStrip or NvStrip).
  • Approximation of Subdivision Surfaces
The tesselation algorithm via the Direct3D instanciation is the core system to provide many triangles. However, the way triangles are generated (via tesselated patches) does not seem appropriate to the evaluation of subdivision surfaces since they are complex recursive algorithms. (By the way, one may find excellent references and details on subdivision surfaces in the awesomo tutorial located here). In the article, QAS (located here), Tamy proposes to replace the recursive evaluation of the subdivision surface by an analytical approximation. The principle is pretty simple: first, project the triangles vertices and the middle of every edge on the limit surface (i.e. the subdivision surface) and compute a quadratic Bezier triangle which fits these 6 points (the vertices and the edge middle points). This leads to a pretty fast and nice approximation of subdivsion surfaces.
  • About Adaptive Tesselation
In this article, Tamy proposes a very practical way to adaptively tesselate patches without cracks. This may lead for example to simple terrain rendering algorithms (if we ignore the possible memory problems which may require texture mip-mapping or page faulting systems). However, adaptive tesselation on current hardware (or rather APIs since Radeon already have tesselation units) may be less efficient for some cases since it requires to change the patches and therefore to queue more draw calls. However, with nicer APIs (like DX11), this will lead to super efficient solutions.
  • Some Details about Asset Production
What can we do with displacement? Actually, many tools already provide the assets. Using ZBrush, we already have the mesh + normal map + displacement map. Rendering this is simple: tesselate the mesh a subdivision surface approximation, displace it and light it correctly using the normal maps in the shader. Another super nice tool would to extend the mega-texture technique (the original article by Sylvain Lefebvre may be found here) to directly paint displacement on objects. However, the QAS technique may suffer from serious drawback when the artists decide to use creasing. In a modeler like Maya, the artist may work in subdivsion surface mode. He can view the tesselated mesh while edition the control mesh. He can then crease triangles, edges or vertices in any level of tesselation and then see the resulting subdisvion surface. For this kind of input, the QAS algorithm requires deep modification. Nevertheless, even if it seems necessary to provide ways to crease the geometry, none of the artists I work with produces the asset in this mode (A ZBrush approach + polygonal mode edition is generally the way they produce meshes).
  • The Demo
Here is the demo. This demo does not include displacement maps but I quickly hacked a stupid procedural displacement in the shader (that you may change). Using a displacement map does not change the performance on a 8800GT (due to the large number of texture units on it). On the 8600GT I have in office, the application becomes texture bound. I do not provide the complete source code yet since it is a bit in a dirty state. I will however try to post it ASAP.


Saturday, August 16, 2008

Cuda real time ray tracing - 100 millions ray/s?

**Update** This post is somehow outdated. The code of Radius-Cuda may be found on the internet I think. Look at or similar places. On my side, I lost the code. Sorry :-(

Ray Tracing!
There is a big buzz around these words nowadays and flamewars are everywhere. What I can however say is that I spent a long time writing ray tracers, Monte-Carlo renderers and: it is fun!

I recently decided to write a ray tracer using Cuda. There were two reasons for this. The first one is that I never wrote one line of Cuda. The second one is that ray tracing was said to be slow on G80 graphics cards due the heavy use of caches and stacks during the accelaration structure traversal generally required by ray tracers.

As usual, I first fetched all the articles I could find on the net, downloaded Cuda, installed it, runned some samples and began to code the thing. As I needed a minimalist ray tracing framework, I decided to drop my old ray tracer, yaCORT, and installed Radius, a small but pretty fast ray tracer developed by Thierry Berger-Perrin. As I had moreover already developed a Bounding Volume Hierarchy on it, it was perfect to inject a Cuda ray tracer with it.

I therefore decided to write a kd-tree traversal and first results were disappointing. Fortunately, with the help of Brian Budge (who posts as me, on the uber-excellent ray tracing forum, I managed to get quite good performance.
  • Summary of the implementation
  1. a stack per ray: no need to synchronize rays using a lot of __syncthreads(). The cost is that the stack will be necessary on local memory (slow) but we do not need anymore to handle ray packets or things like that as done with shared stacks or SIMD ray tracing. The Cuda compiler handle itself the SIMD packing and the branches in the code.
  2. Everything is in textures to cache the access to the kd-tree nodes, the triangles and so on. This is much simpler and faster than coalesing accesses to global memory since no synchro are needed.
  3. KILL ALL UNIONS. Unions are bad even in C (and forbidden in C++ as soon as there is a : constructor). This command goes with this one: KILL ALL ARRAYS unless you put it on shared memory. The cuda compiler seems to put stack allocated array (stg like float my_array[4]) on local memory and accesses will just kill your app. It may work with indirection known at compile time but otherwise, this is just BAD.
  4. Use dirty switches. In my implemention, I need to perform circular permutations on the coordinates (due to the kd-tree traversal and the Baduel / Wald ray-tri intersection). I tested two approaches: an shared array strategy with a simple modulo computation where all the rays are put in shared memory, and a coding horror switch permutation (see the code). The second one was faster. The reason seems to be that in this case, a register allocation (allocating rays on the stack requires 6 more registers) is more efficient that a shared array allocation.
  • Performance
  1. The performance are quite good. I am currently developing on a laptop with the low end GeForce 8400GS but made some preliminary tests on a 8800 GT. With the last optimizations I made, I think that between 12 millions rays / s and 40 millions rays/s may be computed on a GeForce 8800GT on the demo I give in the code. I will test soon the code on a GTX280 and without being too optimistic, I think that 100 millions rays / sec may be generated on it. This may be one of the fastest ray tracers wrote on a single processor.
  2. The other good point is that the memory footprint is much smaller that a CPU ray tracer. Since the GPU is better at computing than at traversing large structures, a small kd-tree has the best performance. For 180,000 triangles, only 5MB are required for the kd-tree.
  • About Cuda
After the first deceptions due to my poor knowledge of the API and the compiler behaviour, I think that Cuda is a nice API, easy to use. G80 architectures are quite impressive and for a non-friendly application like ray tracing, speed is pretty good with no need to write one line of SIMD code. The bad point is that there is no way to have a real low-level access. After this nice experience, I am really eager to see what the upcoming Larrabee will be able to do with ray tracing. The good point is that Intel will certaintly provide both a high level API for fast prototyping (a la Cuda or the promising OpenCL) and a standard compiler / assembler since Larrabee cores are just x86s.

**Update** Source code is ... no more available on the previously posted link. I think you may find somewhere on