Sunday, December 18, 2011

Sauerbraten: a game and an engine that rock

Quick post about a fantastic game and engine: Sauerbraten (you may find it here)
If you are tired by design pattern BS, static type introspection with SFINAE joke, over-design of everything just to print a string... Sauerbraten is for you.

This is one of the code bases that recently impress me the most. Just straightforward and fantastic code. Everything is packed in 70,000 lines of code. Very nice custom (and really fast) UDP network code (ENet), insanely small scripting engine, straight-to-point rendering code, very cool AI, superb UI (3d is even supported), simple but powerful shader system...

Look at the code really and you will see what brutal and brilliant coding style means.

Monday, December 5, 2011

Global variables and cvar in tasking systems...

Usually it is just better to avoid global variables but there is something I really want in my small video game engine is a-la-quake console variables (cvar). They are handy and you can tweak your system in a very easy way.

Problem if you have a fully distributed system changing a cvar from the console is just terrible. If you do some frame-overlapping (for example), you may completely screw your system (saying you decrease / increase the maximum number of particles on screen...). Well, cvars simply become nasty race conditions.

Initially, I wanted to do something complicated ie some copy-on-write stuff. You have a fully reference counted copy of your cvars and when you modify them you copy them into a new reference counted state. Then, instead of reading your variables from the global state, you fetch them from your copy. As long as the copy is valid (ie no modification is done), you can keep it. So it is not that expensive.

Problem is that it is just uber-complicated and your cvar are not global anymore and therefore not that handy. Too bad. The idea I finally decided to implement (not done yet) is to lock the tasking system when you change a global cvar. Basically, you stop all the other threads: you simply need to make them sleep after they run one given task and once everyone except the locking thread is sleeping, you modify the variable.

So it is something like:
cvar = ...

It is brutal and super expensive, but you don't care since this is really rare anyway. From the other thread perspectives (the ones that go to sleep), you absolutely know that nothing is going to happen while you run a task since the lock does not cross the task boundary.

So, for the usual path (ie you just read the cvar), you do not need any mutual exclusion since you know nothing is going to happen inside the task itself.

It is going to take some time to have a real life test case in point-frag but the on-going implementation in yaTS seems to be straightforward.

EDIT: Note also that several threads can request a lock simultaneously. This is not a problem and this is also properly serialized (like any other lock). Only thing is that all variables that may be modified globally have to be reloaded after the lock.

Friday, December 2, 2011

Quick and dirty compiler tests

With point-frag, I am trying to support as many compilers and as many platform (well, Windows, Linux and soon Mac...) as I can. I hope to be able to do a benchmark from it at some point. I anyway quickly compared GCC, ICC and Clang on Linux with my small ray tracing test.
Here the numbers for my nehalem i7 (4 cores / 8 threads, triple channel memory, basically the high end chip 3 years ago). Code is compiled for SSE2 and does not use anything more recent.

GCC (4.6.2)
ray packet: 90 million rays/s
single rays: 9,9 million rays.s

ICC (12.1.0)
ray packet: 95.6 million rays/s
single rays: 10.6 million rays.s

CLANG (3.1 compiled from svn)
ray packet: 94.2 million rays/s
single rays: 9.9 million rays.s

With no surprise, the Intel compiler is slightly faster on a Intel processor than the other compilers. More surprisingly, the speed difference is quite small. Two years ago, GCC / ICC difference was about 15% on the same machine. Latest clang also outputs some fast code.

Good job open source compiler guys!

Wednesday, November 30, 2011 is dead. Long live ompf!

The best ray tracing / graphics related forum in the world is down.
Fortunately, Jacco Bikker sets up a new one here:


Monday, November 28, 2011

Ray traced occlusion culling in point-frag

Arabic city (culled by the technique described here. The model can be found here)

Hello all,
I spent a decent amount of time implementing an occlusion culling technique for the video game engine I am developing: point-frag.

Various techniques
Initially, I thought hard about the different techniques to go beyond the good old frustum technique. First idea that came up was to use occlusion queries and implement CHC or CHC++ algorithms. Both sound good but they are pretty complicated. The first one does a bad job at reordering draw calls to decrease state changes and still badly suffers from stalls while the other one (the better version) is a combination of so many techniques (tighter bounding volumes, multi-queries, batching of already processed nodes, use of heuristics) that it already dissuades me to implement it. Both articles also show that this is not that easy to achieve something close to the optimal solution.

Other idea is the classical precomputed potential visible sets (see the work done by Seth Teller here). This was the technical solution used for Quake (BSP+PVS) and it was the direct implementation of Seth Teller's PhD. The solution is nice but I did not want anything that was too much time consuming like a preprocess since I would like to have a rendering pipeline as stupid as possible (i.e. rendering something which is as much as possible done at run-time and does not depend on anything precomputed (it will be also true for lighting BTW))

Software z-buffer
This leads me to look for something simpler: basically the solution implemented in Ps3 Warhawk, ie a software occlusion technique using a CPU-computed z buffer. This is basically the solution implemented in Battlefields games. You have a lot of nice details here. Note that Killzone 3 also implements a similar solution. All PDF from Guerilla are here.

All these games basically consider a set of occluders usually provided by the artists. They implement a software tiled based rasterizer. Basically:
  • You bin the triangles per tile. Many threads simply traverse each a sub-set of all occluders and find the set of tiles touched by each of them. You then maintain the list of occluders per tile. Note that it seems that KZ3 does not multi-thread this part
  • Then, a bunch of tasks rasterizes the occluders in each tile. Since tiles do not influence each other, this is pretty straightforward
HiZ buffer (highest level) as computed

Once done, you have a z buffer that you can query to know if one given object is visible or not. Usually (this is what I did at least), you take its bounding box, clamp it onto the screen and consider the zmin value (ie the distance to the closest point in the bounding box).

The technique is cool, simple and reasonably fast. There is no GPU dependency, you don't have to struggle with your drivers, everything is perfect.

Except that I don't think rasterization is the perfect tool here :-), at least for me

Ray tracing to compute the z-buffer
This is not supposed to be a rasterization vs ray tracing debate. The thing is that we speak here about software rasterization and software ray tracing. There is no debate at least right now for me with hardware rasterization. GPU rasterizers are fantastic and beautifully complex machines and GPU themselves are fantastic throughput processors.

Small buffers are better with ray tracing
Software rasterizers are another story. If you look at KZ3 numbers, they are kind of bad. They need 2.5ms to complete the rasterization with 1500 triangles for 640x360 resolution. Battlefield uses even smaller z-buffer (256x114).
(Note that a good old G80 is able to 200,000 triangles in 1ms in 1Kx1K)
This is my first problem. For such a small buffer, rasterization is already bad. Rasterization plays nicer with big resolution in particular to offset the initial triangle setup cost. The smaller the screen, the more the setup steps are costly.
With ray tracing, there is no such overhead

Dynamic occluders are not that a problem with few triangles

For dynamic scenes, ray tracing is not really different from tiled based rasterizer. Indeed, a 3D spatial binning of the triangle before building your bounding volume hierarchy (BVH) is not really different from a 2D binning for tiled based rasterization. Actually, 3D binning is even simpler since there is no perspective transformation at all to do on the vertices. You just bin the triangles in a grid. Once done, the BVH build is super fast. See Ingo's paper here for example. It is really fast.

Ray packets are super fast
This is the key. Once you have a hierarchy over your occluders, you just create SIMD packets of rays and you traverse the BVH. This is damn fast. point-frag includes a complete ray packet traversal (and also single ray traversal for future lighting ideas I want to play with). This is just super fast. For Fairy Forest model (see here), I easily achieve 95 millions rays/sec on a 2600K and even 35 millions rays/sec on a 4 cores Fusion AMD laptop. Once again, ray tracing in that case is really simple. There is no perspective transform, no clipping. Just purely simple 3D computations. No corner case.

To be honest, I also implemented in the past a highly optimized tile based rasterizer on CPU. For Fairy Forest (about 180,000 triangles), on 1024x1024 it was only 30% faster that the highly optimized ray packet traversal. However, on 256x256, the ray packet tracer was 80% faster than the rasterizer with a much much simpler code. If you are interested, the DynBVH article by Ingo (Wald) is here. It is a beautiful and simple algorithm. Really cool.

Ray traced occlusion culling is the only valid solution when you do NOT have artists
Yep. This is my case right now. I only have me and google sketchup :-). So, I basically need to compute occlusions on the initial data. And once again, triangle setups and clipping is too expensive if you have 800,000 occluders (ie triangles). Ray tracing does not really care and this is really good.

Ray tracing acceleration structures can be small
This is a usual problem with ray tracing. BVH + triangle soups are huge. Instead of triangle soup, you can use indexed face sets. This is better but still you duplicate your data. Fortunately, at least for static scenes, you can use a bounding volume hierarchy (BVH) as a compression scheme. Manfred (Ernst) and I, when I was still at Intel Labs, indeed demonstrated a simple quantization + compression technique that uses the BVH itself to hierarchically encode a complete geometry. The very good things are:
  • The decompression algorithm is so fast that you do not need to have a cache. You just decompress the structure on the fly at the same time you traverse it.
  • The compression rates are very good. About 5x compared to an indexed face set (index buffer + vertex buffer + BVH) and, if I remember properly, about 8x compared to a triangle soup + BVH
The article is here.

For all these reasons and limitations, I therefore implemented a ray traced z-buffer in point frag. You can actually see the code in the repository and even try it if you want to (be aware, I only tested the code on a very small set of machines and this requires OGL 3.3).

Timing are pretty good. Computing a 256x128 z buffer over 400,000 triangles and culling 1,500 boxes takes 1ms on my nehalem machine. The ray tracing part is multi-threaded but not the culling part (yet). Both are however decently optimized. Finally, the overall rendering structure is simplistic today but I have to start somewhere.

That's all folks :-)

EDIT: note that if you use SPUs and you only have 1,000ish triangles, using ray tracing is really appealing. You asynchronously upload the BVH (possibly compressed) entirely in the local storage and you traverse it at the speed of light :-)

Tuesday, November 15, 2011

Exponential Propagation while using Distributed Job Systems

This is basically a common idiom with work-stealing like tasking systems. Problem with them is that you only have a local view of the complete system.

Typical opposite approach is a shared FIFO of tasks. You possibly have a lot of contention, the memory behaviour is really bad (compared to work-stealing which is really good) but you have an exact view of the running system at each time. The advantage for example is the way to handle priorities. You just make several shared priority queues and picking up the highest priority task is quite easy.

To overcome (partially) these kinds of problems, a usual idiom with work-stealing-like systems is to exponentially create information. People usually call that "nuclear reaction" :-)

Some examples:

Task sets with an atomic counter
This is the approach I chose for yaTS. A task set is just a task to run n times possibly with as many threads as you need. All the threads share the task set atomic counter. The problem is that you need to make it run as quickly as possible on the other threads. The technique is just to reschedule the task twice in its own queue to have this task stolen by other threads. Since all the other threads will also push the task back twice in their queues, you are sure that the task set will propagate exponentially across the queues

Task sets with ranges
This is the parallel_for approach used by TBB. Here you recursively subdivide the range [0,n-1] by splitting it into two parts. This creates a tree of tasks which once again propagate exponentially across the threads

(Note that the TBB approach is a bit nicer that the atomic version, since:
  • You do not have a lot of contention on one counter
  • since you push many tasks, you go back to the scheduler more often. This is better when a higher priority task just appears on your local queues
  • you can load balance in an automatic way by choosing when you stop to recurse)
Highly reactive and fast thread yields and wakeups
This is another issue with distributed queues: when does the thread need to sleep and when does it need to wake up? Once again, propagate information exponentially
  • The threads go to sleep when they do not find anything to do after n successive failed tries. Just use a condition wait for that.
  • A thread that creates a new task randomly tries to wake up two threads. Just use a condition signal for that. Either many tasks are going to be created, and since all the threads wake up possibly two other threads, very quickly, every one is working. Or, no tasks or very few tasks are created, and awake threads quickly return to sleep.
That is it. Ignoring some technical details to kill the threads and avoid dead locks, you can very quickly turn off and turn on your threads. Pretty cool to limit power consumption, to play nice with hyper-threading (it is really bad to spin for nothing while the other threads of your core are working for real) and to avoid any useless contention.

Sunday, November 6, 2011

Point-frag: a distributed game engine

These days and weeks, I am developing a small game engine in my spare time. I personally spent 18 months in the video game industry. It was pretty nice and overall video games are very interesting since the amount of technical challenges is huge: rendering of course, streaming, network, AI, a lot of system programming and so on...

So, after thinking about cool personnal projects I may do, I decided to play with a video game engine. The current code is here:
(do not try to run it right now, it is changing quickly and I am the only developer so it is unstable)

One of the first goals is to make it completely distributed: there is no heavy threads, no main thread or no renderer threads. Basically, *everything* is executing uniformly through the same tasking system (in that case, yaTS).

In all game engines I saw, several heavy threads usually exist. They are typically:
- main thread which is basically a loop
- renderer thread which run the OGL/DX/Gcm/whatever draw calls
- streaming threads
- audio thread
Usually many of them communicate with a dedicated command buffer with the main thread which is usually connected to them in a unidirectional or bidirectional way.

Beside that, all other threads are worker threads and usually communicate with a task interface (like C functions or C++ class or whatever). They are a resource that can be used at any time.

This basically makes the thing somehow cumbersome and we have to deal with several levels of parallelism in the system.

So, the main idea with point frag is to remove that and to have only worker threads:
- there is no main thread
- there is no game loop

Of course, typically you may need to pin some tasks to some threads (like something that does some OGL draw call), the system is flexible enough to run arbitrary task on arbitrary threads.

In the code you can see today, I simply display a model and use the tasking system to load the textures (asynchronously) and compress them into DXT1... Nothing spectacular but I must start somewhere.

However, the cool thing you may already see is the fact that the game loop is emulated by making frame_n spawn frame_{n+1}. This allows the system to continue in a continuation passing style way.

The good thing is that since there is no game loop, you can then design your complete game as a pipeline. The idea is to subdivide your frame into sub-tasks and see each sub-task as an element of a pipeline. This allows to have partial frame overlapping or if you duplicate the pipeline (so you have twice, three times a frame), you can even increase more the parallelism (with more latency).

Since everything is a task, it becomes easier to decrease (for example) the "renderer thread" (basically the one that has the OGL context) burden. Just subdivide the job to do to only have OGL calls in specialized OGL tasks. Everything else can go somewhere else.

Unfortunately, nothing is perfect and IOs are a problem. By IO, I mean any asynchronous requests handled by the system. When you read a file in a non-blocking way, you must do something to cover the latency. In yaTS, you can ask the tasking system to run anything. It is convenient.

However, the other task you call can also do an IO that can be much longer to complete than the IO before. This therefore blocks the calling task that performs the shorter IO.

Fortunately, you can handle this problem with a tasking system a la yaTS / TBB. Basically, it is possible to emulate co-routines with tasks and to yield capabilities

This is for another post (and that does not require assembly)

So, the no-heavy-threads approach may sound good but scheduling remains a damn problem. This is the good thing with thread over-subscription + blocking IOs: the system really helps you making the system progress in a relatively fair and time-shared environment. When you discard the system and do everything yourself, you are on your own

Anyway, point frag is my laboratory for experiments regarding tasking, rendering, procedural generation... Let's get some fun

Wednesday, November 2, 2011

To all HW vendors: do *not* thread your drivers

This is kind of crazy. Many drivers are now spawning worker threads to do that or that task. Please make your driver with as few contention as you can but *really* stop spawning threads behind the developer back.

This just plays very badly with the application and now the developer has to deal with even more discrepancies from HW to HW and drivers to drivers possibly dealing with horrible thread over-subscription.

The only one that knows if the application needs to have a multi-threaded back-end (for creation resource for instance) is the user of the API.

Tuesday, October 11, 2011

Effect on data sharing between CPU cores


I made several benchmarks to stress the tasking system (yaTS) I recently developed and to see if it handles continuation / completion correctly.

Two tests actually spawn a binary tree of tasks.
You can see them in utests.cpp (CascadeNodeTask and NodeTask)

There is only one difference between both:
  1. NodeTask. Here each node completes the root. This means that when a task just finishes to run, it decrements an atomic counter in the root task. When this counter becomes zero, the root is done
  2. CascadeNodeTask. Here each node completes its parent. This basically means that the tasks finish in a cascade way (This is the classical and efficient way to do with work-stealing approach where the tasks are processed in depth-first order)
In (2), there is much less contentation than in (1) because in (1) the root cache line which contains the atomic is going to travel from cores to cores during the process.

This leads to interesting results on my i7 machine (4 cores / 8 threads)

  1. NodeTask
    1 thread == 237 ms
    8 threads == 213 ms
    Speed up == x1.1

  2. CascadeNodeTask
    1 thread == 237 ms
    8 threads == 54 ms
    Speed up == x4.4 (> 4 => Congratulations hyper-threading!)

This is basically the price to share.

(EDIT, also do not see that as a performance "study". This is just a random but interesting performance difference I saw while writing functional tests for yaTS code)

Disk asynchronous IO mess on Linux and libaio

Damn there is something which is really imperfect with Linux: Asynchronous Disk IO. Basically, you want to issue a read or write and you want the function to return immediately.

Well, two classical solutions:
  1. You open a file with O_NONBLOCK flag. No luck, the system does *not* need to open it as non-blocking actually. And after a test, the system actually blocks *anyway* when you open a disk file with O_NONBLOCK file. See here for a documentation
  2. You use the portable posix AIO library like described here. Well, very bad thing, the implementation is done in user mode using pthread to emulate everything. Terrible for any high performing game like application where you do not want any kind of thread oversubscription
Looking for the perfect answer, there is finally a non-portable way to do exactly what I want. This is called libaio and this is just a thin layer above some Linux native system calls.

There is a quick and dirty example of libaio here.

  1. libaio does not spawn any thread behind your back
  2. libaio is really asynchronous
  3. libaio can batch some number IO jobs in one request
So, this is asynchronous IOs done right :-)

EDIT: but libaio does not seem to bufferize anything. I am wondering if there is a asynchronous buffered solution on Linux for disk IO even if in my case (streaming textures / models) unbuffered IOs should be OK. Well, that may require manual handling of buffer for big data you may not want to have fully on RAM before processing it. We will see :-)

Sunday, October 9, 2011

yaTS - yet another Tasking System

Hello all,

some real posts after a *long* time. Well, basically, I want to only post anything related to some piece of software I write and publish.

So, I think I will publish a small serie of posts related to a small library I wrote: yaTS which is here:

As I am extremely lazy for this first post, I will mostly take what I wrote in tasking.hpp :)

The idea here is to present what the tasking system does.

Basically, a "tasking system" offers the possibility to schedule and asynchronously run functions in shared memory "system threads". This is basically a thread pool.

However, yaTS tries to propose more in this API by letting the user:
  1. Define *dependencies* between tasks
  2. Setup priorities for each task ie a higher priority task will be more likely executed than a lower priority one
  3. Setup affinities for each of them ie a task can be "pinned" on some specific hardware thread (typically useful when something depends on a context like an OpenGL context)
The core of this tasking system is a "Task". A task represents a function to call (to run) later. Each task can specify dependencies in two ways:
  1. "Start dependencies" specified (see below) by Task::starts. Basically, to be able to start, a task must have all its start dependencies *ended*
  2. "End dependencies" specified (see below) by Task::ends. In that case, tgo be able to finish, a task must have all its end dependencies *ended*
So, task1->starts(task2) means that task2 cannot start before task1 is ended
Also, task3->ends(task4) means that task4 cannot end before task3 is ended
Note that each task can only start one task and can only end one task

Specifying dependencies in that way allows the user to *dynamically* (ie during the task execution) create a direct acyclic graph of tasks (DAG). One may look at the unit tests to see how it basically works.

yaTS also classicaly implements a TaskSet which is a function which can be run n times (concurrently on any number of threads). TaskSet are a particularly efficient way to logically create n tasks in one chunk.

So, yaTS is somehow similar in TBB or other tasking systems.
However, I tried hard to make it small and I tried to have a API which is not too complicated and also reasonably powerful.

As you may see in utests.cpp, writing dynamic graphs of tasks is easy. You may use classical continuation-like tasking systems or wait for a task completion.

You may "pin" some task onto a particular HW threads. This will be useful when dealing with graphics API or anything with a context attached to a thread.

In the next posts, I would like to present a bit the current implementation details. Later, and as soon as the code is started, I would like to present how we can use this library to design a "game-loop-less" game engine.


Tuesday, September 6, 2011

Blog Resurrection Attempt

Hello all,

This blog has been dead for a long time. I am trying to resurrect it :-)

I recently decided to work on a new personal project, a small video game engine. I basically extracted some code from the very nice Intel open source project, "Embree" (from Manfred Ernst and Sven Woop, outstanding guys I had the chance to work with in the past) you may find here:

Then, I quickly wrote some very basic pieces of code I threw here:

Nothing spectular actually.

Anyway, the first thing I would like to try is to build a "symetrical" multi-threading system. Basically, for several reasons (historical or because some hardware contexts are bound to one thread in particular) , games still use both "heavy" threads like the rendering thread or the main thread. The idea I would like to try is to build a tasking system where only worker threads exist (ie the main thread becomes a regular worker) and to replace every game / render loop by job pipelines.

Stay tuned :-)