Wednesday, November 30, 2011 is dead. Long live ompf!

The best ray tracing / graphics related forum in the world is down.
Fortunately, Jacco Bikker sets up a new one here:


Monday, November 28, 2011

Ray traced occlusion culling in point-frag

Arabic city (culled by the technique described here. The model can be found here)

Hello all,
I spent a decent amount of time implementing an occlusion culling technique for the video game engine I am developing: point-frag.

Various techniques
Initially, I thought hard about the different techniques to go beyond the good old frustum technique. First idea that came up was to use occlusion queries and implement CHC or CHC++ algorithms. Both sound good but they are pretty complicated. The first one does a bad job at reordering draw calls to decrease state changes and still badly suffers from stalls while the other one (the better version) is a combination of so many techniques (tighter bounding volumes, multi-queries, batching of already processed nodes, use of heuristics) that it already dissuades me to implement it. Both articles also show that this is not that easy to achieve something close to the optimal solution.

Other idea is the classical precomputed potential visible sets (see the work done by Seth Teller here). This was the technical solution used for Quake (BSP+PVS) and it was the direct implementation of Seth Teller's PhD. The solution is nice but I did not want anything that was too much time consuming like a preprocess since I would like to have a rendering pipeline as stupid as possible (i.e. rendering something which is as much as possible done at run-time and does not depend on anything precomputed (it will be also true for lighting BTW))

Software z-buffer
This leads me to look for something simpler: basically the solution implemented in Ps3 Warhawk, ie a software occlusion technique using a CPU-computed z buffer. This is basically the solution implemented in Battlefields games. You have a lot of nice details here. Note that Killzone 3 also implements a similar solution. All PDF from Guerilla are here.

All these games basically consider a set of occluders usually provided by the artists. They implement a software tiled based rasterizer. Basically:
  • You bin the triangles per tile. Many threads simply traverse each a sub-set of all occluders and find the set of tiles touched by each of them. You then maintain the list of occluders per tile. Note that it seems that KZ3 does not multi-thread this part
  • Then, a bunch of tasks rasterizes the occluders in each tile. Since tiles do not influence each other, this is pretty straightforward
HiZ buffer (highest level) as computed

Once done, you have a z buffer that you can query to know if one given object is visible or not. Usually (this is what I did at least), you take its bounding box, clamp it onto the screen and consider the zmin value (ie the distance to the closest point in the bounding box).

The technique is cool, simple and reasonably fast. There is no GPU dependency, you don't have to struggle with your drivers, everything is perfect.

Except that I don't think rasterization is the perfect tool here :-), at least for me

Ray tracing to compute the z-buffer
This is not supposed to be a rasterization vs ray tracing debate. The thing is that we speak here about software rasterization and software ray tracing. There is no debate at least right now for me with hardware rasterization. GPU rasterizers are fantastic and beautifully complex machines and GPU themselves are fantastic throughput processors.

Small buffers are better with ray tracing
Software rasterizers are another story. If you look at KZ3 numbers, they are kind of bad. They need 2.5ms to complete the rasterization with 1500 triangles for 640x360 resolution. Battlefield uses even smaller z-buffer (256x114).
(Note that a good old G80 is able to 200,000 triangles in 1ms in 1Kx1K)
This is my first problem. For such a small buffer, rasterization is already bad. Rasterization plays nicer with big resolution in particular to offset the initial triangle setup cost. The smaller the screen, the more the setup steps are costly.
With ray tracing, there is no such overhead

Dynamic occluders are not that a problem with few triangles

For dynamic scenes, ray tracing is not really different from tiled based rasterizer. Indeed, a 3D spatial binning of the triangle before building your bounding volume hierarchy (BVH) is not really different from a 2D binning for tiled based rasterization. Actually, 3D binning is even simpler since there is no perspective transformation at all to do on the vertices. You just bin the triangles in a grid. Once done, the BVH build is super fast. See Ingo's paper here for example. It is really fast.

Ray packets are super fast
This is the key. Once you have a hierarchy over your occluders, you just create SIMD packets of rays and you traverse the BVH. This is damn fast. point-frag includes a complete ray packet traversal (and also single ray traversal for future lighting ideas I want to play with). This is just super fast. For Fairy Forest model (see here), I easily achieve 95 millions rays/sec on a 2600K and even 35 millions rays/sec on a 4 cores Fusion AMD laptop. Once again, ray tracing in that case is really simple. There is no perspective transform, no clipping. Just purely simple 3D computations. No corner case.

To be honest, I also implemented in the past a highly optimized tile based rasterizer on CPU. For Fairy Forest (about 180,000 triangles), on 1024x1024 it was only 30% faster that the highly optimized ray packet traversal. However, on 256x256, the ray packet tracer was 80% faster than the rasterizer with a much much simpler code. If you are interested, the DynBVH article by Ingo (Wald) is here. It is a beautiful and simple algorithm. Really cool.

Ray traced occlusion culling is the only valid solution when you do NOT have artists
Yep. This is my case right now. I only have me and google sketchup :-). So, I basically need to compute occlusions on the initial data. And once again, triangle setups and clipping is too expensive if you have 800,000 occluders (ie triangles). Ray tracing does not really care and this is really good.

Ray tracing acceleration structures can be small
This is a usual problem with ray tracing. BVH + triangle soups are huge. Instead of triangle soup, you can use indexed face sets. This is better but still you duplicate your data. Fortunately, at least for static scenes, you can use a bounding volume hierarchy (BVH) as a compression scheme. Manfred (Ernst) and I, when I was still at Intel Labs, indeed demonstrated a simple quantization + compression technique that uses the BVH itself to hierarchically encode a complete geometry. The very good things are:
  • The decompression algorithm is so fast that you do not need to have a cache. You just decompress the structure on the fly at the same time you traverse it.
  • The compression rates are very good. About 5x compared to an indexed face set (index buffer + vertex buffer + BVH) and, if I remember properly, about 8x compared to a triangle soup + BVH
The article is here.

For all these reasons and limitations, I therefore implemented a ray traced z-buffer in point frag. You can actually see the code in the repository and even try it if you want to (be aware, I only tested the code on a very small set of machines and this requires OGL 3.3).

Timing are pretty good. Computing a 256x128 z buffer over 400,000 triangles and culling 1,500 boxes takes 1ms on my nehalem machine. The ray tracing part is multi-threaded but not the culling part (yet). Both are however decently optimized. Finally, the overall rendering structure is simplistic today but I have to start somewhere.

That's all folks :-)

EDIT: note that if you use SPUs and you only have 1,000ish triangles, using ray tracing is really appealing. You asynchronously upload the BVH (possibly compressed) entirely in the local storage and you traverse it at the speed of light :-)

Tuesday, November 15, 2011

Exponential Propagation while using Distributed Job Systems

This is basically a common idiom with work-stealing like tasking systems. Problem with them is that you only have a local view of the complete system.

Typical opposite approach is a shared FIFO of tasks. You possibly have a lot of contention, the memory behaviour is really bad (compared to work-stealing which is really good) but you have an exact view of the running system at each time. The advantage for example is the way to handle priorities. You just make several shared priority queues and picking up the highest priority task is quite easy.

To overcome (partially) these kinds of problems, a usual idiom with work-stealing-like systems is to exponentially create information. People usually call that "nuclear reaction" :-)

Some examples:

Task sets with an atomic counter
This is the approach I chose for yaTS. A task set is just a task to run n times possibly with as many threads as you need. All the threads share the task set atomic counter. The problem is that you need to make it run as quickly as possible on the other threads. The technique is just to reschedule the task twice in its own queue to have this task stolen by other threads. Since all the other threads will also push the task back twice in their queues, you are sure that the task set will propagate exponentially across the queues

Task sets with ranges
This is the parallel_for approach used by TBB. Here you recursively subdivide the range [0,n-1] by splitting it into two parts. This creates a tree of tasks which once again propagate exponentially across the threads

(Note that the TBB approach is a bit nicer that the atomic version, since:
  • You do not have a lot of contention on one counter
  • since you push many tasks, you go back to the scheduler more often. This is better when a higher priority task just appears on your local queues
  • you can load balance in an automatic way by choosing when you stop to recurse)
Highly reactive and fast thread yields and wakeups
This is another issue with distributed queues: when does the thread need to sleep and when does it need to wake up? Once again, propagate information exponentially
  • The threads go to sleep when they do not find anything to do after n successive failed tries. Just use a condition wait for that.
  • A thread that creates a new task randomly tries to wake up two threads. Just use a condition signal for that. Either many tasks are going to be created, and since all the threads wake up possibly two other threads, very quickly, every one is working. Or, no tasks or very few tasks are created, and awake threads quickly return to sleep.
That is it. Ignoring some technical details to kill the threads and avoid dead locks, you can very quickly turn off and turn on your threads. Pretty cool to limit power consumption, to play nice with hyper-threading (it is really bad to spin for nothing while the other threads of your core are working for real) and to avoid any useless contention.

Sunday, November 6, 2011

Point-frag: a distributed game engine

These days and weeks, I am developing a small game engine in my spare time. I personally spent 18 months in the video game industry. It was pretty nice and overall video games are very interesting since the amount of technical challenges is huge: rendering of course, streaming, network, AI, a lot of system programming and so on...

So, after thinking about cool personnal projects I may do, I decided to play with a video game engine. The current code is here:
(do not try to run it right now, it is changing quickly and I am the only developer so it is unstable)

One of the first goals is to make it completely distributed: there is no heavy threads, no main thread or no renderer threads. Basically, *everything* is executing uniformly through the same tasking system (in that case, yaTS).

In all game engines I saw, several heavy threads usually exist. They are typically:
- main thread which is basically a loop
- renderer thread which run the OGL/DX/Gcm/whatever draw calls
- streaming threads
- audio thread
Usually many of them communicate with a dedicated command buffer with the main thread which is usually connected to them in a unidirectional or bidirectional way.

Beside that, all other threads are worker threads and usually communicate with a task interface (like C functions or C++ class or whatever). They are a resource that can be used at any time.

This basically makes the thing somehow cumbersome and we have to deal with several levels of parallelism in the system.

So, the main idea with point frag is to remove that and to have only worker threads:
- there is no main thread
- there is no game loop

Of course, typically you may need to pin some tasks to some threads (like something that does some OGL draw call), the system is flexible enough to run arbitrary task on arbitrary threads.

In the code you can see today, I simply display a model and use the tasking system to load the textures (asynchronously) and compress them into DXT1... Nothing spectacular but I must start somewhere.

However, the cool thing you may already see is the fact that the game loop is emulated by making frame_n spawn frame_{n+1}. This allows the system to continue in a continuation passing style way.

The good thing is that since there is no game loop, you can then design your complete game as a pipeline. The idea is to subdivide your frame into sub-tasks and see each sub-task as an element of a pipeline. This allows to have partial frame overlapping or if you duplicate the pipeline (so you have twice, three times a frame), you can even increase more the parallelism (with more latency).

Since everything is a task, it becomes easier to decrease (for example) the "renderer thread" (basically the one that has the OGL context) burden. Just subdivide the job to do to only have OGL calls in specialized OGL tasks. Everything else can go somewhere else.

Unfortunately, nothing is perfect and IOs are a problem. By IO, I mean any asynchronous requests handled by the system. When you read a file in a non-blocking way, you must do something to cover the latency. In yaTS, you can ask the tasking system to run anything. It is convenient.

However, the other task you call can also do an IO that can be much longer to complete than the IO before. This therefore blocks the calling task that performs the shorter IO.

Fortunately, you can handle this problem with a tasking system a la yaTS / TBB. Basically, it is possible to emulate co-routines with tasks and to yield capabilities

This is for another post (and that does not require assembly)

So, the no-heavy-threads approach may sound good but scheduling remains a damn problem. This is the good thing with thread over-subscription + blocking IOs: the system really helps you making the system progress in a relatively fair and time-shared environment. When you discard the system and do everything yourself, you are on your own

Anyway, point frag is my laboratory for experiments regarding tasking, rendering, procedural generation... Let's get some fun

Wednesday, November 2, 2011

To all HW vendors: do *not* thread your drivers

This is kind of crazy. Many drivers are now spawning worker threads to do that or that task. Please make your driver with as few contention as you can but *really* stop spawning threads behind the developer back.

This just plays very badly with the application and now the developer has to deal with even more discrepancies from HW to HW and drivers to drivers possibly dealing with horrible thread over-subscription.

The only one that knows if the application needs to have a multi-threaded back-end (for creation resource for instance) is the user of the API.