Thursday, December 27, 2012

Playing with oprofile on Linux

I just spent some time using oprofile on Linux. oprofile allows basically to profile everything running on your system with a rather low overhead.
Lots of details here:

A quick overview:

1. make oprofile use your kernel (root). Ignore it if you do not care about kernel symbols
$ opcontrol --vmlinux=/usr/src/linux-3.2.13-1-ARCH/vmlinux

2. make oprofile measure time spent in libraries (root)
$ opcontrol --separate=lib

3. start oprofile (root) 
$ opcontrol --start

4. measure time  spent in functions for "cube_client" :-)
$ opreport --demangle=smart  --symbols ~/src/cube/src/cube_client

5. You get this:
CPU: AMD64 family12h, speed 1497.22 MHz (estimated)
Counted CPU_CLK_UNHALTED events (CPU Clocks not Halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        image name               symbol name
68078004 72.8798             /usr/lib/dri/
3984600   4.2657  cube_client              world::render_seg_new(float, float, float, int, int, int, int, int)
3060858   3.2768  cube_client              world::isoccluded(float, float, float, float, float)
2838442   3.0386  cube_client              rdr::render_flat(int, int, int, int, int, sqr*, sqr*, sqr*, sqr*, bool)
2696379   2.8866             __mcount_internal
1777893   1.9033  cube_client              world::render_wall(sqr*, sqr*, int, int, int, int, int, sqr*, sqr*, bool)
1664943   1.7824             mcount
1450401   1.5527             /lib/
794027    0.8500             _wordcopy_fwd_aligned
787522    0.8431  cube_client              world::computeraytable(float, float)
687461    0.7360  cube_client              rdr::render_square(int, float, float, float, float, int, int, int, int, int, sqr*, sqr*, bool)
669011    0.7162  cube_client              rdr::ogl::lookuptex(int, int&, int&)
640268    0.6854       /usr/lib/fglrx/
603660    0.6462  cube_client              rdr::render_flatdelta(int, int, int, int, float, float, float, float, sqr*, sqr*, sqr*, sqr*, bool)
486056    0.5203  cube_client              rdr::ogl::drawframe(int, int, float)
441795    0.4730  cube_client              rdr::ogl::addstrip(int, int, int)
164852    0.1765             __memmove_sse2
160559    0.1719  cube_client              _ZN7physics7collideEP6dynentbff.constprop.6


You will find lot of information on the net like how to capture other perf counters. Look at:
$ opcontrol --list-events

Thursday, August 2, 2012

IvyBridge GPU documentation and code on the web

Hello all,

Just to remind that IVB spec is online. I mean:
  • The complete state setting is documented
  • The complete ISA for the "shader cores" (we call them Execution Units or EUs) is also here
  • The documentation for the interesting shared functions (sampler, loads/stores) is also here
The documentation is here:

It may be a bit rough to start with but fortunately, we also have a complete MIT licensed open source OpenGL stack called "Mesa". It is here:

Mesa is a big piece of code that supports many targets but you may see the Intel GPU specific part here:

Friday, July 20, 2012

Various code bases and cube (the game) pushed on github

I decided to follow the hype and I pushed everything on github:

Note that I also cleaned up cube (the first cube game, the one before Sauerbraten) to make it compile with no complaint on gcc 4.6 and VS2010.

Did I already say that cube is amazing? The complete engine (cube itself + its network layer aka the 2005 version of enet) takes 10,000 LoC.

I may write some post reviewing the code. However, just to have fun, really look at command.cpp which basically implements an insanely powerful mini-scripting language in 300 LoC. Really cool.

Obviously, for more features, Sauerbraten and its next incarnation Tesseract are also really impressive.

Cube however remains unique by its size.

Thursday, May 10, 2012

Euphoria after finding a hard-core bug

Just a rant about debugging sessions.
I have been writing some low-level code related to GPU driver and compiler stuff on Linux. Well, at some point, I run some of my tests and bam, a bug.

Here starts the usual story, the one I think all programmers experience the hard way.

The bug does not appear on the HW simulator. Do not expect using valgrind or some stuff. This is happening on the GPU. Youhou.... You are on your own: The hardware against you.

This is the story everybody who has to work on low-level stuff already knows:

Day 1: You found a bug before leaving. Well, let's see that tomorrow.

Day 2: OK, You think it is easy. You read the code just trying to figure out what is happening. You deeply think about the root causes that you can imagine. So, you play with some of parameters. Sometimes it crashes, sometimes it does not.
Then, you start to see that a dozen of parameters are starting to activate / deactivate the crash.
The horror really begins when every hypothesis you may imagine are just discarded by the next experiment.
Then, you know you are going to suffer. This is not an algorithm bug. This is a damn random corruption due to some buffer overrun or unitialized value somewhere in the code.

Day 3: There is no way around. You are not clever enough to consider all the symptoms together, find the disease and cure it. Well, you are not Dr House and your intellect is limited. So, you decide to go the brutal way: broad spectrum antibiotics, chemotherapy, radiation therapy and amputation. You do not know what is going on so you start butchering the code. One by one you deactivate the sub-systems.
1/ Do I do a buffer overrun on the compiler? OK, compile the code offline, dump it to file and reload it instead.
2/ Did I do something bad in the c++ pre-main function? OK, remove the cpp file you have and do without them.
3/ Is this structure buggy? OK, just do not use it and directly use the low-level pieces instead
4/ Cool! You find an alternative way to make the code work. Now, progressively merge both source codes (the sick one and the healthy one)
532/ You now have a test case made of 10 C files that either works or fails.
1023/ After 10 hours of debugging sessions, you find the bug... Just here. Plain stupid, so obvious but fucking well hidden.

This is just so good. You do not even need to implement the fix. You just *know* it works. All symptoms make sense. One bug that just explains everything.

Finding a bug is so much like making a disease diagnosis.

Tuesday, April 17, 2012

Tesseract is live!

Lee Salzman (aka eihrul), the maintainer of Sauerbraten and I spent some time working on a revamped version of the Sauerbraten engine.

Well, Lee did almost everything even if I did some parts of the snapped cascaded shadow map implementation and we got some interesting discussions about what to do.

Tesseract aimed at being a modern version of Sauerbraten engine with dynamic lighting and shadow maps for everything based on a kind-of-standard deferred shading pipeline.

This includes omni-shadow maps using either cube maps or tetra shadow maps, sunlight shadows with cascaded shadow maps. Everything uses actually a giant shadow map atlas.

Good thing is that Lee made a two-pass approach to handle and shade transparent objects (front-most layer) and as you may see here, it gives some pretty neat results on transparent / semi transparent objects.

Code is finally here:

Wednesday, March 14, 2012

Some yaTS updates

Hello all,
I spent some time to update yaTS (
Some of the big items:

Reactive way to yield / wake up threads
Initially I wanted to do something purely distributed like randomly waking up threads when you push a task. Actually, it was pretty bad. You do not really want anything random here. You just want to wake up a thread that for sure is going to be sleeping and you really want to give it something to do. So, basically, the solution is to maintain a *global* bitfield which identifies which threads are sleeping. When you schedule a new task:
  • if the task has no affinity (it is stealable), you push it on your own queue, pick up a thread that is sleeping, you wake it up and you mailbox the ID of your queue such that the thread that just woke up exactly knows where to pick the job.
  • if the task has an affinity, you just push the task in the target thread (the one that matches the affinity ID) and if the thread sleeps, you wake it up.
Of course, there is two or three details to avoid any kind of deadlocks and this go-to-sleep and wake-up operations are more or less serialized. So, go-to-bed and wake-up is no more distributed but I do not think this is important since any application that spends its time yielding and waking up threads has already a problem.

A Task profiler
Just a C++ interface with user-provided callbacks that are triggered on a bunch of internal events (before the run function, after the run function, when the task is done and so on...)

A Bunch of C++ policy-like classes
I added many helper task classes built on top of tasking.hpp. The really cool one is the one that extends the tasks to make them support any number of start and end dependencies. Rather convenient. Good thing is that you can even be a start dependency of another task even if you are already running or even if you are actually done.

More restricted waitForCompletion
Waiting for a task from a task is actually really hard. So hard that my previous implemention was just totally wrong. My idea was to do something like that:

 void Task::waitForCompletion(Task &other) {
while (other.isNotDone()) runAnyOtherTaskThatIsReady();

Big problem is that the other task you may run from "runAnyOtherTaskThatIsReady()" can actually itself wait for yourself. So, by recursing, you just introduce a cycle in your DAG and the system is deadlocked. The idea to have something working and easy to use may be to have yieldable tasks and somehow to replace the stack recursion by co-routines that are somehow scheduled in a proper way. Well, I am lazy right now :-)

I therefore removed waitForCompletion. Now, you can wait for a task but only from *outside* the tasking system.
Fortunately, with the relaxed multiple dependency tasks, you can have the same thing as waitForCompletion with no deadlock and with a continuation passing style approach.
Basically, this:

Task *MyTask::run(void) {
run some code A;
run some code B;
return NULL;

becomes with some c++ lambda horrors:
Task *MyTask::run(void) {
run some code A;
Task *cont = spawn([]() { run some code B; return NULL; });
otherTask->multiStarts(cont); // new flexible end dependencies here
return cont;

Not too bad with our favorite I-am-so-evil-that-you-will-shoot-yourself-in-the-head-with-my-pseudo-closure-with-no-garbage-collection-that-makes-you-use-reference-counted-pointers-that-introduce-memory-leaks-with-cycling-references C++ language :-)


Sunday, February 26, 2012

Mutt and auto-complete

Hello all,
I recently switched to imap gmail + mutt to handle my emails. One thing missing is an auto-complete stuff for it. Mutt has a powerful and simple way to do that by setting up query_command in your .muttrc.

set query_command = "my_script %s" and then pressing Ctrl-T while filling an email address field, this basically calls my_script with the prefix you just enter as an argument.

Mostly and completely inspired by this thread:
I wrote this quick and very dirty script that basically processes the imap cache and extract addresses from it:

Here it is:

cat ~/.mutt/cache/bodies/imaps\:segovia.benjamin\\\:993/INBOX/* |\
grep --regex "<$1.*@.*\?>" |\
perl -pe "s|.*<$1([^@]*?)@(.*?)>.*|$1\1@\2|g"

Small thing needed is to replace the path to the cache by your own path.

Next step would be to combine that with abook to have both an address book and a imap cache parser to get emails.