Tuesday, October 11, 2011

Effect on data sharing between CPU cores

Hello,

I made several benchmarks to stress the tasking system (yaTS) I recently developed and to see if it handles continuation / completion correctly.

Two tests actually spawn a binary tree of tasks.
You can see them in utests.cpp (CascadeNodeTask and NodeTask)

There is only one difference between both:
  1. NodeTask. Here each node completes the root. This means that when a task just finishes to run, it decrements an atomic counter in the root task. When this counter becomes zero, the root is done
  2. CascadeNodeTask. Here each node completes its parent. This basically means that the tasks finish in a cascade way (This is the classical and efficient way to do with work-stealing approach where the tasks are processed in depth-first order)
In (2), there is much less contentation than in (1) because in (1) the root cache line which contains the atomic is going to travel from cores to cores during the process.

This leads to interesting results on my i7 machine (4 cores / 8 threads)

  1. NodeTask
    1 thread == 237 ms
    8 threads == 213 ms
    Speed up == x1.1

  2. CascadeNodeTask
    1 thread == 237 ms
    8 threads == 54 ms
    Speed up == x4.4 (> 4 => Congratulations hyper-threading!)

This is basically the price to share.

(EDIT, also do not see that as a performance "study". This is just a random but interesting performance difference I saw while writing functional tests for yaTS code)

Disk asynchronous IO mess on Linux and libaio

Damn there is something which is really imperfect with Linux: Asynchronous Disk IO. Basically, you want to issue a read or write and you want the function to return immediately.

Well, two classical solutions:
  1. You open a file with O_NONBLOCK flag. No luck, the system does *not* need to open it as non-blocking actually. And after a test, the system actually blocks *anyway* when you open a disk file with O_NONBLOCK file. See here for a documentation
  2. You use the portable posix AIO library like described here. Well, very bad thing, the implementation is done in user mode using pthread to emulate everything. Terrible for any high performing game like application where you do not want any kind of thread oversubscription
Looking for the perfect answer, there is finally a non-portable way to do exactly what I want. This is called libaio and this is just a thin layer above some Linux native system calls.

There is a quick and dirty example of libaio here.

So,
  1. libaio does not spawn any thread behind your back
  2. libaio is really asynchronous
  3. libaio can batch some number IO jobs in one request
So, this is asynchronous IOs done right :-)

EDIT: but libaio does not seem to bufferize anything. I am wondering if there is a asynchronous buffered solution on Linux for disk IO even if in my case (streaming textures / models) unbuffered IOs should be OK. Well, that may require manual handling of buffer for big data you may not want to have fully on RAM before processing it. We will see :-)

Sunday, October 9, 2011

yaTS - yet another Tasking System

Hello all,

some real posts after a *long* time. Well, basically, I want to only post anything related to some piece of software I write and publish.

So, I think I will publish a small serie of posts related to a small library I wrote: yaTS which is here:

http://code.google.com/p/yats/

As I am extremely lazy for this first post, I will mostly take what I wrote in tasking.hpp :)

The idea here is to present what the tasking system does.

Basically, a "tasking system" offers the possibility to schedule and asynchronously run functions in shared memory "system threads". This is basically a thread pool.

However, yaTS tries to propose more in this API by letting the user:
  1. Define *dependencies* between tasks
  2. Setup priorities for each task ie a higher priority task will be more likely executed than a lower priority one
  3. Setup affinities for each of them ie a task can be "pinned" on some specific hardware thread (typically useful when something depends on a context like an OpenGL context)
The core of this tasking system is a "Task". A task represents a function to call (to run) later. Each task can specify dependencies in two ways:
  1. "Start dependencies" specified (see below) by Task::starts. Basically, to be able to start, a task must have all its start dependencies *ended*
  2. "End dependencies" specified (see below) by Task::ends. In that case, tgo be able to finish, a task must have all its end dependencies *ended*
So, task1->starts(task2) means that task2 cannot start before task1 is ended
Also, task3->ends(task4) means that task4 cannot end before task3 is ended
Note that each task can only start one task and can only end one task

Specifying dependencies in that way allows the user to *dynamically* (ie during the task execution) create a direct acyclic graph of tasks (DAG). One may look at the unit tests to see how it basically works.

yaTS also classicaly implements a TaskSet which is a function which can be run n times (concurrently on any number of threads). TaskSet are a particularly efficient way to logically create n tasks in one chunk.

So, yaTS is somehow similar in TBB or other tasking systems.
However, I tried hard to make it small and I tried to have a API which is not too complicated and also reasonably powerful.

As you may see in utests.cpp, writing dynamic graphs of tasks is easy. You may use classical continuation-like tasking systems or wait for a task completion.

You may "pin" some task onto a particular HW threads. This will be useful when dealing with graphics API or anything with a context attached to a thread.

In the next posts, I would like to present a bit the current implementation details. Later, and as soon as the code is started, I would like to present how we can use this library to design a "game-loop-less" game engine.

Ben