Tuesday, October 11, 2011

Effect on data sharing between CPU cores


I made several benchmarks to stress the tasking system (yaTS) I recently developed and to see if it handles continuation / completion correctly.

Two tests actually spawn a binary tree of tasks.
You can see them in utests.cpp (CascadeNodeTask and NodeTask)

There is only one difference between both:
  1. NodeTask. Here each node completes the root. This means that when a task just finishes to run, it decrements an atomic counter in the root task. When this counter becomes zero, the root is done
  2. CascadeNodeTask. Here each node completes its parent. This basically means that the tasks finish in a cascade way (This is the classical and efficient way to do with work-stealing approach where the tasks are processed in depth-first order)
In (2), there is much less contentation than in (1) because in (1) the root cache line which contains the atomic is going to travel from cores to cores during the process.

This leads to interesting results on my i7 machine (4 cores / 8 threads)

  1. NodeTask
    1 thread == 237 ms
    8 threads == 213 ms
    Speed up == x1.1

  2. CascadeNodeTask
    1 thread == 237 ms
    8 threads == 54 ms
    Speed up == x4.4 (> 4 => Congratulations hyper-threading!)

This is basically the price to share.

(EDIT, also do not see that as a performance "study". This is just a random but interesting performance difference I saw while writing functional tests for yaTS code)


Jiao Lu said...

Cool。any chance to test this code in amd 6 core cpu?

bouliiii said...

Take the code from here: