Monday, November 28, 2011

Ray traced occlusion culling in point-frag



Arabic city (culled by the technique described here. The model can be found here)

Hello all,
I spent a decent amount of time implementing an occlusion culling technique for the video game engine I am developing: point-frag.


Various techniques
Initially, I thought hard about the different techniques to go beyond the good old frustum technique. First idea that came up was to use occlusion queries and implement CHC or CHC++ algorithms. Both sound good but they are pretty complicated. The first one does a bad job at reordering draw calls to decrease state changes and still badly suffers from stalls while the other one (the better version) is a combination of so many techniques (tighter bounding volumes, multi-queries, batching of already processed nodes, use of heuristics) that it already dissuades me to implement it. Both articles also show that this is not that easy to achieve something close to the optimal solution.

Other idea is the classical precomputed potential visible sets (see the work done by Seth Teller here). This was the technical solution used for Quake (BSP+PVS) and it was the direct implementation of Seth Teller's PhD. The solution is nice but I did not want anything that was too much time consuming like a preprocess since I would like to have a rendering pipeline as stupid as possible (i.e. rendering something which is as much as possible done at run-time and does not depend on anything precomputed (it will be also true for lighting BTW))

Software z-buffer
This leads me to look for something simpler: basically the solution implemented in Ps3 Warhawk, ie a software occlusion technique using a CPU-computed z buffer. This is basically the solution implemented in Battlefields games. You have a lot of nice details here. Note that Killzone 3 also implements a similar solution. All PDF from Guerilla are here.

All these games basically consider a set of occluders usually provided by the artists. They implement a software tiled based rasterizer. Basically:
  • You bin the triangles per tile. Many threads simply traverse each a sub-set of all occluders and find the set of tiles touched by each of them. You then maintain the list of occluders per tile. Note that it seems that KZ3 does not multi-thread this part
  • Then, a bunch of tasks rasterizes the occluders in each tile. Since tiles do not influence each other, this is pretty straightforward
HiZ buffer (highest level) as computed

Once done, you have a z buffer that you can query to know if one given object is visible or not. Usually (this is what I did at least), you take its bounding box, clamp it onto the screen and consider the zmin value (ie the distance to the closest point in the bounding box).

The technique is cool, simple and reasonably fast. There is no GPU dependency, you don't have to struggle with your drivers, everything is perfect.

Except that I don't think rasterization is the perfect tool here :-), at least for me

Ray tracing to compute the z-buffer
This is not supposed to be a rasterization vs ray tracing debate. The thing is that we speak here about software rasterization and software ray tracing. There is no debate at least right now for me with hardware rasterization. GPU rasterizers are fantastic and beautifully complex machines and GPU themselves are fantastic throughput processors.

Small buffers are better with ray tracing
Software rasterizers are another story. If you look at KZ3 numbers, they are kind of bad. They need 2.5ms to complete the rasterization with 1500 triangles for 640x360 resolution. Battlefield uses even smaller z-buffer (256x114).
(Note that a good old G80 is able to 200,000 triangles in 1ms in 1Kx1K)
This is my first problem. For such a small buffer, rasterization is already bad. Rasterization plays nicer with big resolution in particular to offset the initial triangle setup cost. The smaller the screen, the more the setup steps are costly.
With ray tracing, there is no such overhead

Dynamic occluders are not that a problem with few triangles

For dynamic scenes, ray tracing is not really different from tiled based rasterizer. Indeed, a 3D spatial binning of the triangle before building your bounding volume hierarchy (BVH) is not really different from a 2D binning for tiled based rasterization. Actually, 3D binning is even simpler since there is no perspective transformation at all to do on the vertices. You just bin the triangles in a grid. Once done, the BVH build is super fast. See Ingo's paper here for example. It is really fast.

Ray packets are super fast
This is the key. Once you have a hierarchy over your occluders, you just create SIMD packets of rays and you traverse the BVH. This is damn fast. point-frag includes a complete ray packet traversal (and also single ray traversal for future lighting ideas I want to play with). This is just super fast. For Fairy Forest model (see here), I easily achieve 95 millions rays/sec on a 2600K and even 35 millions rays/sec on a 4 cores Fusion AMD laptop. Once again, ray tracing in that case is really simple. There is no perspective transform, no clipping. Just purely simple 3D computations. No corner case.

To be honest, I also implemented in the past a highly optimized tile based rasterizer on CPU. For Fairy Forest (about 180,000 triangles), on 1024x1024 it was only 30% faster that the highly optimized ray packet traversal. However, on 256x256, the ray packet tracer was 80% faster than the rasterizer with a much much simpler code. If you are interested, the DynBVH article by Ingo (Wald) is here. It is a beautiful and simple algorithm. Really cool.

Ray traced occlusion culling is the only valid solution when you do NOT have artists
Yep. This is my case right now. I only have me and google sketchup :-). So, I basically need to compute occlusions on the initial data. And once again, triangle setups and clipping is too expensive if you have 800,000 occluders (ie triangles). Ray tracing does not really care and this is really good.

Ray tracing acceleration structures can be small
This is a usual problem with ray tracing. BVH + triangle soups are huge. Instead of triangle soup, you can use indexed face sets. This is better but still you duplicate your data. Fortunately, at least for static scenes, you can use a bounding volume hierarchy (BVH) as a compression scheme. Manfred (Ernst) and I, when I was still at Intel Labs, indeed demonstrated a simple quantization + compression technique that uses the BVH itself to hierarchically encode a complete geometry. The very good things are:
  • The decompression algorithm is so fast that you do not need to have a cache. You just decompress the structure on the fly at the same time you traverse it.
  • The compression rates are very good. About 5x compared to an indexed face set (index buffer + vertex buffer + BVH) and, if I remember properly, about 8x compared to a triangle soup + BVH
The article is here.

For all these reasons and limitations, I therefore implemented a ray traced z-buffer in point frag. You can actually see the code in the repository and even try it if you want to (be aware, I only tested the code on a very small set of machines and this requires OGL 3.3).

Timing are pretty good. Computing a 256x128 z buffer over 400,000 triangles and culling 1,500 boxes takes 1ms on my nehalem machine. The ray tracing part is multi-threaded but not the culling part (yet). Both are however decently optimized. Finally, the overall rendering structure is simplistic today but I have to start somewhere.

That's all folks :-)

EDIT: note that if you use SPUs and you only have 1,000ish triangles, using ray tracing is really appealing. You asynchronously upload the BVH (possibly compressed) entirely in the local storage and you traverse it at the speed of light :-)

6 comments:

TomV said...

Interesting read! Thanks for posting this.

Changmin said...

Awesome!!! I can't wait tomorrow.
I want to test this right now.

en13 said...

Great article and many interesting links! I used to render q3 maps (except subdivision curves) in my gpu raytracer, but never really understood, how original quake did it. Thanks for PVS link.

Vilém Otte said...

Hi, sounds cool. Although comparing G80 vs. software rasterizers is out of the hand (you would have to do CUDA rasterizer on G80 and then compare), as the largest performance-eaters are done by hardware, not programmable shaders.

As I have a fast rasterizer, I'll try this solution (used artist placed portals and precomputed PVS, but as I'm currently working on dynamic open-world project, I can't afford these).

Unknown said...

This is a very interesting approach to generating a software occlusion for queries, thank you for writing about this! How do you think this would perform on modern ARM processors when using NEON using a very tiny output like 80x48? I know most GPUs on mobile do deferred binning on the triangles anyway but this would ease up initial draw calls and the vertex program pipeline greatly on large dynamic scenes.

I've also got a cool idea which might work well for point-frag (as I noticed you've already got a job-based way to stream/upload DXT textures in the fly using squish) is to also render out UV co-ordinates and mip-map information to do tile-determination CPU-side if you want virtual textures, saves doing a readback from GPU, but calculating the derivatives to calculate accurate mips might mean more in-between rays, unless you do this step on the final rastered image CPU-side...

John Dudley said...

Acetech who have been one of the leading software development company in Delhi giving creative custom software development to meet interesting business challenges for the absolute most distinguished organizations and associations in the country.