<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-4111162136844726038</id><updated>2012-01-07T13:26:36.382-08:00</updated><title type='text'>bouliiii's blog</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://bouliiii.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://bouliiii.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>bouliiii</name><uri>http://www.blogger.com/profile/05579514597684961397</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/_LCZOtCM95u4/SKeDch2IuRI/AAAAAAAAABA/HPFyAVikdGU/S220/talon.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>15</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-4111162136844726038.post-6174199435111319626</id><published>2011-12-18T02:16:00.000-08:00</published><updated>2011-12-18T03:57:16.464-08:00</updated><title type='text'>Sauerbraten: a game and an engine that rock</title><content type='html'>Quick post about a fantastic game and engine: Sauerbraten (you may find it &lt;a href="http://sauerbraten.org/"&gt;here&lt;/a&gt;)&lt;br /&gt;If you are tired by design pattern BS, static type introspection with &lt;a href="http://en.wikipedia.org/wiki/Substitution_failure_is_not_an_error"&gt;SFINAE&lt;/a&gt; joke, over-design of everything just to print a string... Sauerbraten is for you.&lt;br /&gt;&lt;br /&gt;This is one of the code bases that recently impress me the most. Just straightforward and fantastic code. Everything is packed in 70,000 lines of code. Very nice custom (and really fast) UDP network code (&lt;a href="http://enet.bespin.org/"&gt;ENet&lt;/a&gt;), insanely small scripting engine, straight-to-point rendering code, very cool AI, superb UI (3d is even supported), simple but powerful shader system...&lt;br /&gt;&lt;br /&gt;Look at the code really and you will see what brutal and brilliant coding style means.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4111162136844726038-6174199435111319626?l=bouliiii.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://bouliiii.blogspot.com/feeds/6174199435111319626/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4111162136844726038&amp;postID=6174199435111319626' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/6174199435111319626'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/6174199435111319626'/><link rel='alternate' type='text/html' href='http://bouliiii.blogspot.com/2011/12/sauerbraten-game-and-engine-that-kick.html' title='Sauerbraten: a game and an engine that rock'/><author><name>bouliiii</name><uri>http://www.blogger.com/profile/05579514597684961397</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/_LCZOtCM95u4/SKeDch2IuRI/AAAAAAAAABA/HPFyAVikdGU/S220/talon.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4111162136844726038.post-2697790402623983660</id><published>2011-12-05T22:30:00.000-08:00</published><updated>2011-12-06T10:45:01.800-08:00</updated><title type='text'>Global variables and cvar in tasking systems...</title><content type='html'>Usually it is just better to avoid global variables but there is something I &lt;span style="font-style: italic;"&gt;really&lt;/span&gt; want in my small video game engine is a-la-quake console variables (cvar). They are handy and you can tweak your system in a very easy way.&lt;br /&gt;&lt;br /&gt;Problem if you have a fully distributed system changing a cvar from the console is just terrible. If you do some frame-overlapping (for example), you may completely screw your system (saying you decrease / increase the maximum number of particles on screen...). Well, cvars simply become nasty race conditions.&lt;br /&gt;&lt;br /&gt;Initially, I wanted to do something complicated ie some copy-on-write stuff. You have a fully reference counted copy of your cvars and when you modify them you copy them into a new reference counted state. Then, instead of reading your variables from the global state, you fetch them from your copy. As long as the copy is valid (ie no modification is done), you can keep it. So it is not that expensive.&lt;br /&gt;&lt;br /&gt;Problem is that it is just uber-complicated and your cvar are not global anymore and therefore not that handy. Too bad. The idea I finally decided to implement (not done yet) is to &lt;span style="font-style: italic;"&gt;lock&lt;/span&gt; the tasking system when you change a global cvar. Basically, you stop all the other threads: you simply need to make them sleep after they run one given task and once everyone except the locking thread is sleeping, you modify the variable.&lt;br /&gt;&lt;br /&gt;So it is something like:&lt;br /&gt;TaskingSystemLock();&lt;br /&gt;cvar = ...&lt;br /&gt;TaskingSystemUnlock();&lt;br /&gt;&lt;br /&gt;It is brutal and super expensive, but you don't care since this is really rare anyway. From the other thread perspectives (the ones that go to sleep), you absolutely know that nothing is going to happen while you run a task since the lock does &lt;span style="font-style: italic;"&gt;not&lt;/span&gt; cross the task boundary.&lt;br /&gt;&lt;br /&gt;So, for the usual path (ie you just read the cvar), you do not need any mutual exclusion since you know nothing is going to happen inside the task itself.&lt;br /&gt;&lt;br /&gt;It is going to take some time to have a real life test case in point-frag but the on-going implementation in yaTS seems to be straightforward.&lt;br /&gt;&lt;br /&gt;EDIT: Note also that several threads can request a lock simultaneously. This is not a problem and this is also properly serialized (like any other lock). Only thing is that all variables that may be modified globally have to be reloaded after the lock.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4111162136844726038-2697790402623983660?l=bouliiii.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://bouliiii.blogspot.com/feeds/2697790402623983660/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4111162136844726038&amp;postID=2697790402623983660' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/2697790402623983660'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/2697790402623983660'/><link rel='alternate' type='text/html' href='http://bouliiii.blogspot.com/2011/12/global-variables-and-car-in-tasking.html' title='Global variables and cvar in tasking systems...'/><author><name>bouliiii</name><uri>http://www.blogger.com/profile/05579514597684961397</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/_LCZOtCM95u4/SKeDch2IuRI/AAAAAAAAABA/HPFyAVikdGU/S220/talon.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4111162136844726038.post-5002188506569723485</id><published>2011-12-02T01:43:00.000-08:00</published><updated>2011-12-02T02:36:04.463-08:00</updated><title type='text'>Quick and dirty compiler tests</title><content type='html'>With point-frag, I am trying to support as many compilers and as many platform (well, Windows, Linux and soon Mac...) as I can. I hope to be able to do a benchmark from it at some point. I anyway quickly compared GCC, ICC and Clang on Linux with my small ray tracing test.&lt;br /&gt;Here the numbers for my nehalem i7 (4 cores / 8 threads, triple channel memory, basically the high end chip 3 years ago). Code is compiled for SSE2 and does not use anything more recent.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;GCC&lt;/span&gt; (4.6.2)&lt;br /&gt;ray packet: 90 million rays/s&lt;br /&gt;single rays: 9,9 million rays.s&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;ICC &lt;/span&gt;&lt;span style="font-size:100%;"&gt;(12.1.0)&lt;/span&gt;&lt;br /&gt;ray packet: 95.6 million rays/s&lt;br /&gt;single rays: 10.6 million rays.s&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;span style="font-size:130%;"&gt;CLANG&lt;/span&gt; &lt;/span&gt;&lt;span style="font-size:100%;"&gt;(3.1 compiled from svn&lt;/span&gt;&lt;span style="font-size:100%;"&gt;)&lt;/span&gt;&lt;br /&gt;ray packet: 94.2 million rays/s&lt;br /&gt;single rays: 9.9 million rays.s&lt;br /&gt;&lt;br /&gt;With no surprise, the Intel compiler is slightly faster on a Intel processor than the other compilers. More surprisingly, the speed difference is quite &lt;span style="font-style: italic;"&gt;small&lt;/span&gt;. Two years ago, GCC / ICC difference was about 15% on the same machine. Latest clang also outputs some fast code.&lt;br /&gt;&lt;br /&gt;Good job open source compiler guys!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4111162136844726038-5002188506569723485?l=bouliiii.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://bouliiii.blogspot.com/feeds/5002188506569723485/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4111162136844726038&amp;postID=5002188506569723485' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/5002188506569723485'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/5002188506569723485'/><link rel='alternate' type='text/html' href='http://bouliiii.blogspot.com/2011/12/quick-and-dirty-compiler-tests.html' title='Quick and dirty compiler tests'/><author><name>bouliiii</name><uri>http://www.blogger.com/profile/05579514597684961397</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/_LCZOtCM95u4/SKeDch2IuRI/AAAAAAAAABA/HPFyAVikdGU/S220/talon.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4111162136844726038.post-5452428676977734328</id><published>2011-11-30T23:17:00.000-08:00</published><updated>2011-11-30T23:19:03.688-08:00</updated><title type='text'>ompf.org is dead. Long live ompf!</title><content type='html'>The best ray tracing / graphics related forum in the world ompf.org is down.&lt;br /&gt;Fortunately, Jacco Bikker sets up a new one here:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://igad2.nhtv.nl/ompf2" target="_blank"&gt;http://igad2.nhtv.nl/ompf2&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Cheers,&lt;br /&gt;Ben&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4111162136844726038-5452428676977734328?l=bouliiii.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://bouliiii.blogspot.com/feeds/5452428676977734328/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4111162136844726038&amp;postID=5452428676977734328' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/5452428676977734328'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/5452428676977734328'/><link rel='alternate' type='text/html' href='http://bouliiii.blogspot.com/2011/11/ompforg-is-dead-long-live-ompf.html' title='ompf.org is dead. Long live ompf!'/><author><name>bouliiii</name><uri>http://www.blogger.com/profile/05579514597684961397</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/_LCZOtCM95u4/SKeDch2IuRI/AAAAAAAAABA/HPFyAVikdGU/S220/talon.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4111162136844726038.post-2395926758725205708</id><published>2011-11-28T23:26:00.000-08:00</published><updated>2011-11-29T04:11:28.442-08:00</updated><title type='text'>Ray traced occlusion culling in point-frag</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/-ZtQePlG39S0/TtSgc1unYNI/AAAAAAAAAZM/E8eN1DSP67c/s1600/hiz.jpg"&gt;&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/-QjSnCzRxui0/TtSfCGEZq2I/AAAAAAAAAZA/_6ZpzLV_p90/s1600/arabic_city.jpg"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 400px;" src="http://4.bp.blogspot.com/-QjSnCzRxui0/TtSfCGEZq2I/AAAAAAAAAZA/_6ZpzLV_p90/s400/arabic_city.jpg" alt="" id="BLOGGER_PHOTO_ID_5680339888358665058" border="0" /&gt;&lt;/a&gt;&lt;span style="font-size:85%;"&gt;Arabic city (culled by the technique described here. The model can be found &lt;a href="http://sketchup.google.com/3dwarehouse/details?mid=809c028285022519b6b5161efde62bf9"&gt;here&lt;/a&gt;)&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;Hello all,&lt;br /&gt;I spent a decent amount of time implementing an occlusion culling technique for the video game engine I am developing: &lt;a href="http://code.google.com/p/point-frag/"&gt;point-frag&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:180%;"&gt;Various techniques&lt;/span&gt;&lt;br /&gt;Initially, I thought hard about the different techniques to go beyond the good old frustum technique. First idea that came up was to use occlusion queries and implement &lt;a href="http://www.cg.tuwien.ac.at/research/publications/2004/Bittner-2004-CHC/"&gt;CHC&lt;/a&gt; or &lt;a href="http://www.cg.tuwien.ac.at/research/publications/2008/mattausch-2008-CHC/"&gt;CHC++&lt;/a&gt; algorithms. Both sound good but they are pretty complicated. The first one does a bad job at reordering draw calls to decrease state changes and still badly suffers from stalls while the other one (the better version) is a combination of so many techniques (tighter bounding volumes, multi-queries, batching of already processed nodes, use of heuristics) that it already dissuades me to implement it. Both articles also show that this is not that easy to achieve something close to the optimal solution.&lt;br /&gt;&lt;br /&gt;Other idea is the classical precomputed potential visible sets (see the work done by Seth Teller &lt;a href="http://people.csail.mit.edu/seth/pubs/pubs.html"&gt;here&lt;/a&gt;). This was the technical solution used for Quake (BSP+PVS) and it was the direct implementation of Seth Teller's PhD. The solution is nice but I did not want anything that was too much time consuming like a preprocess since I would like to have a rendering pipeline as stupid as possible (i.e. rendering something which is as much as possible done at run-time and does not depend on anything precomputed (it will be also true for lighting BTW))&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:180%;"&gt;Software z-buffer&lt;/span&gt;&lt;br /&gt;This leads me to look for something simpler: basically the solution implemented in Ps3 Warhawk, ie a software occlusion technique using a CPU-computed z buffer. This is basically the solution implemented in Battlefields games. You have a lot of nice details &lt;a href="http://repi.blogspot.com/2011/03/dice-at-gdc11.html"&gt;here&lt;/a&gt;. Note that Killzone 3 also implements a similar solution. All PDF from Guerilla are &lt;a href="http://www.guerrilla-games.com/publications/"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;All these games basically consider a set of occluders usually provided by the artists. They implement a software tiled based rasterizer. Basically:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;You bin the triangles per tile. Many threads simply traverse each a sub-set of all occluders and find the set of tiles touched by each of them. You then maintain the list of occluders per tile. Note that it seems that KZ3 does not multi-thread this part&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Then, a bunch of tasks rasterizes the occluders in each tile. Since tiles do not influence each other, this is pretty straightforward&lt;/li&gt;&lt;/ul&gt;&lt;div style="text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-ZtQePlG39S0/TtSgc1unYNI/AAAAAAAAAZM/E8eN1DSP67c/s1600/hiz.jpg"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 256px; height: 128px;" src="http://1.bp.blogspot.com/-ZtQePlG39S0/TtSgc1unYNI/AAAAAAAAAZM/E8eN1DSP67c/s400/hiz.jpg" alt="" id="BLOGGER_PHOTO_ID_5680341447340417234" border="0" /&gt;&lt;/a&gt;&lt;span style="font-size:85%;"&gt;HiZ buffer (highest level) as computed&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;Once done, you have a z buffer that you can query to know if one given object is visible or not. Usually (this is what I did at least), you take its bounding box, clamp it onto the screen and consider the zmin value (ie the distance to the closest point in the bounding box).&lt;br /&gt;&lt;br /&gt;The technique is cool, simple and reasonably fast. There is no GPU dependency, you don't have to struggle with your drivers, everything is perfect.&lt;br /&gt;&lt;br /&gt;Except that I don't think rasterization is the perfect tool here :-), at least for me&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:180%;"&gt;Ray tracing to compute the z-buffer&lt;/span&gt;&lt;br /&gt;This is not supposed to be a rasterization vs ray tracing debate. The thing is that we speak here about &lt;span style="font-style: italic;"&gt;software&lt;/span&gt; rasterization and &lt;span style="font-style: italic;"&gt;software&lt;/span&gt; ray tracing. There is no debate at least right now for me with hardware rasterization. GPU rasterizers are fantastic and beautifully complex machines and GPU themselves are fantastic throughput processors.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Small buffers are better with ray tracing&lt;/span&gt;&lt;br /&gt;Software rasterizers are another story. If you look at KZ3 numbers, they are kind of bad. They need 2.5ms to complete the rasterization with 1500 triangles for 640x360 resolution. Battlefield uses even smaller z-buffer (256x114).&lt;br /&gt;(Note that a good old G80 is able to 200,000 triangles in 1ms in 1Kx1K)&lt;br /&gt;This is my first problem. For such a small buffer, rasterization is already bad. Rasterization plays nicer with big resolution in particular to offset the initial triangle setup cost. The smaller the screen, the more the setup steps are costly.&lt;br /&gt;With ray tracing, there is no such overhead&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;br /&gt;Dynamic occluders are not that a problem with few triangles&lt;/span&gt;&lt;br /&gt;For dynamic scenes, ray tracing is not really different from tiled based rasterizer. Indeed, a 3D spatial binning of the triangle before building your bounding volume hierarchy (BVH) is not really different from a 2D binning for tiled based rasterization. Actually, 3D binning is even simpler since there is &lt;span style="font-style: italic;"&gt;no&lt;/span&gt; perspective transformation at all to do on the vertices. You just bin the triangles in a grid. Once done, the BVH build is super fast. See Ingo's paper &lt;a href="http://www.sci.utah.edu/%7Ewald/Publications/2007///FastBuild/download//fastbuild.pdf"&gt;here&lt;/a&gt; for example. It is &lt;span style="font-style: italic;"&gt;really &lt;/span&gt;fast.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Ray packets are super fast&lt;/span&gt;&lt;br /&gt;This is the key. Once you  have a hierarchy over your occluders, you just create SIMD packets of rays and you traverse the BVH. This is &lt;span style="font-style: italic;"&gt;damn&lt;/span&gt; fast. point-frag includes a complete ray packet traversal (and also single ray traversal for future lighting ideas I want to play with). This is just super fast. For Fairy Forest model (see &lt;a href="http://www.sci.utah.edu/%7Ewald/animrep/"&gt;here&lt;/a&gt;), I easily achieve 95 millions rays/sec on a 2600K and even 35 millions rays/sec on a 4 cores Fusion AMD laptop. Once again, ray tracing in that case is really simple. There is no perspective transform, no clipping. Just purely simple 3D computations. No corner case.&lt;br /&gt;&lt;br /&gt;To be honest, I also implemented in the past a highly optimized tile based rasterizer on CPU. For Fairy Forest (about 180,000 triangles), on 1024x1024 it was only 30% faster that the highly optimized ray packet traversal. However, on 256x256, the ray packet tracer was 80% faster than the rasterizer with a much much simpler code. If you are interested, the DynBVH article by Ingo (Wald) is &lt;a href="http://www.sci.utah.edu/%7Ewald/Publications/2007///BVH/download//togbvh.pdf"&gt;here&lt;/a&gt;. It is a beautiful and simple algorithm. Really cool.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Ray traced occlusion culling is the only valid solution when you do NOT have artists&lt;/span&gt;&lt;br /&gt;Yep. This is my case right now. I only have me and google sketchup :-). So, I basically need to compute occlusions on the initial data. And once again, triangle setups and clipping is too expensive if you have 800,000 occluders (ie triangles). Ray tracing does not really care and this is really good.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Ray tracing acceleration structures can be small&lt;/span&gt;&lt;br /&gt;This is a usual problem with ray tracing. BVH + triangle soups are huge. Instead of triangle soup, you can use indexed face sets. This is better but still you duplicate your data. Fortunately, at least for static scenes, you can use a bounding volume hierarchy (BVH) as a compression scheme. Manfred (Ernst) and I, when I was still at Intel Labs, indeed demonstrated a simple quantization + compression technique that uses the BVH itself to hierarchically encode a complete geometry. The very good things are:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The decompression algorithm is so fast that you do not need to have a cache. You just decompress the structure on the fly at the same time you traverse it.&lt;/li&gt;&lt;li&gt;The compression rates are very good. About 5x compared to an indexed face set (index buffer + vertex buffer + BVH) and, if I remember properly, about 8x compared to a triangle soup + BVH&lt;/li&gt;&lt;/ul&gt;The article is &lt;a href="http://visual-computing.intel-research.net/publications/papers/2010/hmq/hmq.pdf"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;For all these reasons and limitations, I therefore implemented a ray traced z-buffer in point frag. You can actually see the code in the repository and even try it if you want to (be aware, I only tested the code on a very small set of machines and this requires OGL 3.3).&lt;br /&gt;&lt;br /&gt;Timing are pretty good. Computing a 256x128 z buffer over 400,000 triangles and culling 1,500 boxes takes 1ms on my nehalem machine. The ray tracing part is multi-threaded but not the culling part (yet). Both are however decently optimized. Finally, the overall rendering structure is simplistic today but I have to start somewhere.&lt;br /&gt;&lt;br /&gt;That's all folks :-)&lt;br /&gt;&lt;br /&gt;EDIT: note that if you use SPUs and you only have 1,000ish triangles, using ray tracing is really appealing. You asynchronously upload the BVH (possibly compressed) &lt;span style="font-style: italic;"&gt;entirely&lt;/span&gt; in the local storage and you traverse it at the speed of light :-)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4111162136844726038-2395926758725205708?l=bouliiii.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://bouliiii.blogspot.com/feeds/2395926758725205708/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4111162136844726038&amp;postID=2395926758725205708' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/2395926758725205708'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/2395926758725205708'/><link rel='alternate' type='text/html' href='http://bouliiii.blogspot.com/2011/11/ray-traced-occlusion-culling-in-point.html' title='Ray traced occlusion culling in point-frag'/><author><name>bouliiii</name><uri>http://www.blogger.com/profile/05579514597684961397</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/_LCZOtCM95u4/SKeDch2IuRI/AAAAAAAAABA/HPFyAVikdGU/S220/talon.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-QjSnCzRxui0/TtSfCGEZq2I/AAAAAAAAAZA/_6ZpzLV_p90/s72-c/arabic_city.jpg' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4111162136844726038.post-6294273750763880352</id><published>2011-11-15T01:12:00.000-08:00</published><updated>2011-11-15T03:47:24.663-08:00</updated><title type='text'>Exponential Propagation while using Distributed Job Systems</title><content type='html'>This is basically a common idiom with work-stealing like tasking systems. Problem with them is that you only have a local view of the complete system.&lt;br /&gt;&lt;br /&gt;Typical opposite approach is a shared FIFO of tasks. You possibly have a lot of contention, the memory behaviour is really bad (compared to work-stealing which is really good) but you have an exact view of the running system at each time. The advantage for example is the way to handle priorities. You just make several &lt;span style="font-style:italic;"&gt;shared&lt;/span&gt; priority queues and picking up the highest priority task is quite easy.&lt;br /&gt;&lt;br /&gt;To overcome (partially) these kinds of problems, a usual idiom with work-stealing-like systems is to exponentially create information. People usually call that "nuclear reaction" :-)&lt;br /&gt;&lt;br /&gt;Some examples:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Task sets with an atomic counter&lt;/span&gt;&lt;br /&gt;This is the approach I chose for yaTS. A task set is just a task to run &lt;span style="font-style: italic;"&gt;n &lt;/span&gt;&lt;span&gt;times&lt;/span&gt; possibly with as many threads as you need. All the threads share the task set atomic counter. The problem is that you need to make it run as quickly as possible on the other threads. The technique is just to reschedule the task &lt;span style="font-style: italic;"&gt;twice&lt;/span&gt; in its own queue to have this task stolen by other threads. Since all the other threads will also push the task back twice in their queues, you are sure that the task set will propagate exponentially across the queues&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Task sets with ranges&lt;/span&gt;&lt;br /&gt;This is the parallel_for approach used by TBB. Here you recursively subdivide the range &lt;span style="font-style: italic;"&gt;[0,n-1]&lt;/span&gt; by splitting it into two parts. This creates a tree of tasks which once again propagate exponentially across the threads&lt;br /&gt;&lt;br /&gt;(Note that the TBB approach is a bit nicer that the atomic version, since:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;You do not have a lot of contention on one counter&lt;/li&gt;&lt;li&gt;since you push many tasks, you go back to the scheduler more often. This is better when a higher priority task just appears on your local queues&lt;/li&gt;&lt;li&gt;you can load balance in an automatic way by choosing when you stop to recurse)&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-size:130%;"&gt;Highly reactive and fast thread yields and wakeups&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size:100%;"&gt;This is another issue with distributed queues: when does the thread need &lt;/span&gt;to sleep and when does it need to wake up? Once again, propagate information exponentially&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The threads go to sleep when they do not find anything to do after &lt;span style="font-style: italic;"&gt;n &lt;/span&gt;successive failed tries. Just use a condition wait for that.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;A thread that creates a new task randomly tries to wake up two threads. Just use a condition signal for that. Either many tasks are going to be created, and since all the threads wake up possibly two other threads, very quickly, every one is working. Or, no tasks or very few tasks are created, and awake threads quickly return to sleep.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;That is it. Ignoring some technical details to kill the threads and avoid dead locks, you can very quickly turn off and turn on your threads. Pretty cool to limit power consumption, to play nice with hyper-threading (it is really bad to spin for nothing while the other threads of your core are working for real) and to avoid any useless contention.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4111162136844726038-6294273750763880352?l=bouliiii.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://bouliiii.blogspot.com/feeds/6294273750763880352/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4111162136844726038&amp;postID=6294273750763880352' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/6294273750763880352'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/6294273750763880352'/><link rel='alternate' type='text/html' href='http://bouliiii.blogspot.com/2011/11/exponentially-propagate-everything-with.html' title='Exponential Propagation while using Distributed Job Systems'/><author><name>bouliiii</name><uri>http://www.blogger.com/profile/05579514597684961397</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/_LCZOtCM95u4/SKeDch2IuRI/AAAAAAAAABA/HPFyAVikdGU/S220/talon.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4111162136844726038.post-7843511790519361934</id><published>2011-11-06T00:12:00.000-07:00</published><updated>2011-11-06T04:04:16.542-08:00</updated><title type='text'>Point-frag: a distributed game engine</title><content type='html'>These days and weeks, I am developing a small game engine in my spare time. I personally spent 18 months in the video game industry. It was pretty nice and overall video games are very interesting since the amount of technical challenges is huge: rendering of course, streaming, network, AI, a lot of system programming and so on...&lt;br /&gt;&lt;br /&gt;So, after thinking about cool personnal projects I may do, I decided to play with a video game engine. The current code is here:&lt;br /&gt;&lt;a href="http://code.google.com/p/point-frag/"&gt;http://code.google.com/p/point-frag/&lt;/a&gt;&lt;br /&gt;(do not try to run it right now, it is changing quickly and I am the only developer so it is unstable)&lt;br /&gt;&lt;br /&gt;One of the first goals is to make it completely distributed: there is no heavy threads, no main thread or no renderer threads. Basically, *everything* is executing uniformly through the same tasking system (in that case, &lt;a href="http://code.google.com/p/yats/"&gt;yaTS&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;In all game engines I saw, several heavy threads usually exist. They are typically:&lt;br /&gt;- main thread which is basically a loop&lt;br /&gt;- renderer thread which run the OGL/DX/Gcm/whatever draw calls&lt;br /&gt;- streaming threads&lt;br /&gt;- audio thread&lt;br /&gt;Usually many of them communicate with a dedicated command buffer with the main thread which is usually connected to them in a unidirectional or bidirectional way. &lt;br /&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;Beside&lt;/span&gt; that, all other threads are worker threads and usually communicate with a task interface (like C functions or C++ class or whatever). They are a &lt;span style="font-style:italic;"&gt;resource&lt;/span&gt; that can be used at any time.&lt;br /&gt;&lt;br /&gt;This basically makes the thing somehow cumbersome and we have to deal with several levels of parallelism in the system.&lt;br /&gt;&lt;br /&gt;So, the main idea with point frag is to remove that and to have only worker threads:&lt;br /&gt;- there is no main thread&lt;br /&gt;- there is &lt;span style="font-style:italic;"&gt;no game loop&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Of course, typically you may need to pin some tasks to some threads (like something that does some OGL draw call), the system is flexible enough to run arbitrary task on arbitrary threads.&lt;br /&gt;&lt;br /&gt;In the code you can see today, I simply display a model and use the tasking system to load the textures (asynchronously) and compress them into DXT1... Nothing spectacular but I must start somewhere.&lt;br /&gt;&lt;br /&gt;However, the cool thing you may already see is the fact that the game loop is emulated by making frame_n spawn frame_{n+1}. This allows the system to continue in a continuation passing style way.&lt;br /&gt;&lt;br /&gt;The good thing is that since there is no game loop, you can then design your complete game as a pipeline. The idea is to subdivide your frame into sub-tasks and see each sub-task as an element of a pipeline. This allows to have partial frame overlapping or if you duplicate the pipeline (so you have twice, three times a frame), you can even increase more the parallelism (with more latency).&lt;br /&gt;&lt;br /&gt;Since everything is a task, it becomes easier to decrease (for example) the "renderer thread" (basically the one that has the OGL context) burden. Just subdivide the job to do to only have OGL calls in specialized OGL tasks. Everything else can go somewhere else.&lt;br /&gt;&lt;br /&gt;Unfortunately, nothing is perfect and IOs are a problem. By IO, I mean any asynchronous requests handled by the system. When you read a file in a non-blocking way, you must do something to cover the latency. In yaTS, you can ask the tasking system to run anything. It is convenient. &lt;br /&gt;&lt;br /&gt;However, the other task you call can also do an IO that can be much longer to complete than the IO before. This therefore blocks the calling task that performs the shorter IO.&lt;br /&gt;&lt;br /&gt;Fortunately, you can handle this problem with a tasking system a la yaTS / TBB. Basically, it is possible to emulate co-routines with tasks and to yield capabilities&lt;br /&gt;&lt;br /&gt;This is for another post (and that does not require assembly)&lt;br /&gt;:-)&lt;br /&gt;&lt;br /&gt;So, the no-heavy-threads approach may sound good but scheduling remains a damn problem. This is the good thing with thread over-subscription + blocking IOs: the system really helps you making the system progress in a relatively fair and time-shared environment. When you discard the system and do everything yourself, you are on your own &lt;br /&gt;&lt;br /&gt;Anyway, point frag is my laboratory for experiments regarding tasking, rendering, procedural generation... Let's get some fun&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4111162136844726038-7843511790519361934?l=bouliiii.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://bouliiii.blogspot.com/feeds/7843511790519361934/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4111162136844726038&amp;postID=7843511790519361934' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/7843511790519361934'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/7843511790519361934'/><link rel='alternate' type='text/html' href='http://bouliiii.blogspot.com/2011/11/point-frag-distributed-game-engine.html' title='Point-frag: a distributed game engine'/><author><name>bouliiii</name><uri>http://www.blogger.com/profile/05579514597684961397</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/_LCZOtCM95u4/SKeDch2IuRI/AAAAAAAAABA/HPFyAVikdGU/S220/talon.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4111162136844726038.post-1498522842930652489</id><published>2011-11-02T12:53:00.000-07:00</published><updated>2011-11-02T13:03:01.323-07:00</updated><title type='text'>To all HW vendors: do *not* thread your drivers</title><content type='html'>This is kind of crazy. Many drivers are now spawning worker threads to do that or that task. Please make your driver with as few contention as you can but *really* stop spawning threads behind the developer back.&lt;br /&gt;&lt;br /&gt;This just plays very badly with the application and now the developer has to deal with even more discrepancies from HW to HW and drivers to drivers possibly dealing with horrible thread over-subscription.&lt;br /&gt;&lt;br /&gt;The only one that knows if the application needs to have a multi-threaded back-end (for creation resource for instance) is the user of the API.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4111162136844726038-1498522842930652489?l=bouliiii.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://bouliiii.blogspot.com/feeds/1498522842930652489/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4111162136844726038&amp;postID=1498522842930652489' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/1498522842930652489'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/1498522842930652489'/><link rel='alternate' type='text/html' href='http://bouliiii.blogspot.com/2011/11/to-all-gpu-vendors-do-not-thread-your.html' title='To all HW vendors: do *not* thread your drivers'/><author><name>bouliiii</name><uri>http://www.blogger.com/profile/05579514597684961397</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/_LCZOtCM95u4/SKeDch2IuRI/AAAAAAAAABA/HPFyAVikdGU/S220/talon.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4111162136844726038.post-3464562943776458220</id><published>2011-10-11T21:45:00.000-07:00</published><updated>2011-10-31T19:27:11.858-07:00</updated><title type='text'>Effect on data sharing between CPU cores</title><content type='html'>Hello,&lt;br /&gt;&lt;br /&gt;I made several benchmarks to stress the tasking system (&lt;a href="http://code.google.com/p/yats/"&gt;yaTS&lt;/a&gt;) I recently developed and to see if it handles continuation / completion correctly.&lt;br /&gt;&lt;br /&gt;Two tests actually spawn a binary tree of tasks.&lt;br /&gt;You can see them in utests.cpp (CascadeNodeTask and NodeTask)&lt;br /&gt;&lt;br /&gt;There is only one difference between both:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;NodeTask. Here each node completes the &lt;span style="font-weight: bold;"&gt;root&lt;/span&gt;. This means that when a task just finishes to run, it decrements an atomic counter in the root task. When this counter becomes zero, the root is done&lt;/li&gt;&lt;li&gt;CascadeNodeTask. Here each node completes its &lt;span style="font-weight: bold;"&gt;parent&lt;/span&gt;. This basically means that the tasks finish in a cascade way (This is the classical and efficient way to do with work-stealing approach where the tasks are processed in depth-first order)&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;In (2), there is much less contentation than in (1) because in (1) the root cache line which contains the atomic is going to travel from cores to cores during the process.&lt;br /&gt;&lt;br /&gt;This leads to interesting results on my i7 machine (4 cores / 8 threads)&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;NodeTask&lt;br /&gt;1 thread == 237 ms&lt;br /&gt;8 threads == 213 ms&lt;br /&gt;Speed up == x1.1&lt;/li&gt;&lt;br /&gt;&lt;li&gt;CascadeNodeTask&lt;br /&gt;1 thread == 237 ms&lt;br /&gt;8 threads == 54 ms&lt;br /&gt;Speed up == x4.4 (&amp;gt; 4 =&amp;gt; Congratulations hyper-threading!)&lt;/li&gt;&lt;br /&gt;&lt;/ol&gt;This is basically the price to share.&lt;br /&gt;&lt;br /&gt;(EDIT, also do not see that as a performance "study". This is just a random  but interesting performance difference I saw while writing functional tests for yaTS code)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4111162136844726038-3464562943776458220?l=bouliiii.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://bouliiii.blogspot.com/feeds/3464562943776458220/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4111162136844726038&amp;postID=3464562943776458220' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/3464562943776458220'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/3464562943776458220'/><link rel='alternate' type='text/html' href='http://bouliiii.blogspot.com/2011/10/effect-on-cache-line-sharing-between.html' title='Effect on data sharing between CPU cores'/><author><name>bouliiii</name><uri>http://www.blogger.com/profile/05579514597684961397</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/_LCZOtCM95u4/SKeDch2IuRI/AAAAAAAAABA/HPFyAVikdGU/S220/talon.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4111162136844726038.post-1660598458721092687</id><published>2011-10-11T02:31:00.000-07:00</published><updated>2011-10-30T06:14:24.358-07:00</updated><title type='text'>Disk asynchronous IO mess on Linux and libaio</title><content type='html'>Damn there is something which is really imperfect with Linux: Asynchronous Disk IO. Basically, you want to issue a read or write and you want the function to return immediately.&lt;br /&gt;&lt;br /&gt;Well, two classical solutions:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;You open a file with O_NONBLOCK flag. No luck, the system does *&lt;span style="font-weight: bold;"&gt;not&lt;/span&gt;* need to open it as non-blocking actually. And after a test, the system actually blocks *&lt;span style="font-weight: bold;"&gt;anyway&lt;/span&gt;* when you open a disk file with O_NONBLOCK file. See &lt;a href="http://linux.die.net/man/2/open"&gt;here&lt;/a&gt; for a documentation&lt;br /&gt;&lt;/li&gt;&lt;li&gt;You use the portable posix AIO library like described &lt;a href="http://www.bullopensource.org/posix/"&gt;here&lt;/a&gt;. Well, very bad thing, the implementation is done in user mode using pthread to emulate everything. Terrible for any high performing game like application where you do not want any kind of thread oversubscription&lt;/li&gt;&lt;/ol&gt;Looking for the perfect answer, there is finally a non-portable way to do &lt;span style="font-weight: bold;"&gt;exactly&lt;/span&gt; what I want. This is called libaio and this is just a thin layer above some Linux native system calls.&lt;br /&gt;&lt;br /&gt;There is a quick and dirty example of libaio &lt;a href="http://code.google.com/p/quick-and-dirty-libaio-example/source/browse/trunk/aio.cpp"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;So,&lt;br /&gt;&lt;ol&gt;&lt;li&gt;libaio does not spawn any thread behind your back&lt;/li&gt;&lt;li&gt;libaio is really asynchronous&lt;/li&gt;&lt;li&gt;libaio can batch some number IO jobs in one request&lt;/li&gt;&lt;/ol&gt;So, this is asynchronous IOs done right :-)&lt;br /&gt;&lt;br /&gt;EDIT: but libaio does not seem to bufferize anything. I am wondering if there is a asynchronous buffered solution on Linux for disk IO even if in my case (streaming textures / models) unbuffered IOs should be OK. Well, that may require manual handling of buffer for big data you may not want to have fully on RAM before processing it. We will see :-)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4111162136844726038-1660598458721092687?l=bouliiii.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://bouliiii.blogspot.com/feeds/1660598458721092687/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4111162136844726038&amp;postID=1660598458721092687' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/1660598458721092687'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/1660598458721092687'/><link rel='alternate' type='text/html' href='http://bouliiii.blogspot.com/2011/10/disk-asynchronous-io-mess-on-linux-and.html' title='Disk asynchronous IO mess on Linux and libaio'/><author><name>bouliiii</name><uri>http://www.blogger.com/profile/05579514597684961397</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/_LCZOtCM95u4/SKeDch2IuRI/AAAAAAAAABA/HPFyAVikdGU/S220/talon.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4111162136844726038.post-1482118787172638883</id><published>2011-10-09T00:06:00.000-07:00</published><updated>2011-10-09T00:20:04.018-07:00</updated><title type='text'>yaTS - yet another Tasking System</title><content type='html'>Hello all,&lt;br /&gt;&lt;br /&gt;some real posts after a *long* time. Well, basically, I want to only post anything related to some piece of software I write and publish.&lt;br /&gt;&lt;br /&gt;So, I think I will publish a small serie of posts related to a small library I wrote: yaTS which is here:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://code.google.com/p/yats/"&gt;http://code.google.com/p/yats/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;As I am extremely lazy for this first post, I will mostly take what I wrote in tasking.hpp :)&lt;br /&gt;&lt;br /&gt;The idea here is to present what the tasking system does.&lt;br /&gt;&lt;br /&gt;Basically, a "tasking system" offers the possibility to schedule and asynchronously run functions in shared memory "system threads". This is basically a thread pool.&lt;br /&gt;&lt;br /&gt;However, yaTS tries to propose more in this API by letting the user:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Define *dependencies* between tasks&lt;/li&gt;&lt;li&gt;Setup priorities for each task ie a higher priority task will be more likely executed than a lower priority one&lt;/li&gt;&lt;li&gt;Setup affinities for each of them ie a task can be "pinned" on some specific hardware thread (typically useful when something depends on a context like an OpenGL context)&lt;/li&gt;&lt;/ol&gt;The core of this tasking system is a "Task". A task represents a function to call (to run) later. Each task can specify dependencies in two ways:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;"Start dependencies" specified (see below) by Task::starts. Basically, to be able to start, a task must have all its start dependencies *ended*&lt;/li&gt;&lt;li&gt;"End dependencies" specified (see below) by Task::ends. In that case, tgo be able to finish, a task must have all its end dependencies *ended*&lt;/li&gt;&lt;/ol&gt;So,&lt;span style="font-family:lucida grande;"&gt; task1-&amp;gt;starts(task2)&lt;/span&gt; means that task2 cannot start before task1 is ended&lt;br /&gt;Also, task3-&amp;gt;ends(task4) means that task4 cannot end before task3 is ended&lt;br /&gt;Note that each task can only start one task and can only end one task&lt;br /&gt;&lt;br /&gt;Specifying dependencies in that way allows the user to *dynamically* (ie during the task execution) create a direct acyclic graph of tasks (DAG). One may look at the unit tests to see how it basically works.&lt;br /&gt;&lt;br /&gt;yaTS also classicaly implements a TaskSet which is a function which can be run n times (concurrently on any number of threads). TaskSet are a particularly efficient way to logically create n tasks in one chunk.&lt;br /&gt;&lt;br /&gt;So, yaTS is somehow similar in TBB or other tasking systems.&lt;br /&gt;However, I tried hard to make it small and I tried to have a API which is not too complicated and also reasonably powerful.&lt;br /&gt;&lt;br /&gt;As you may see in utests.cpp, writing dynamic graphs of tasks is easy. You may use classical continuation-like tasking systems or wait for a task completion.&lt;br /&gt;&lt;br /&gt;You may "pin" some task onto a particular HW threads. This will be useful when dealing with graphics API or anything with a context attached to a thread.&lt;br /&gt;&lt;br /&gt;In the next posts, I would like to present a bit the current implementation details. Later, and as soon as the code is started, I would like to present how we can use this library to design a "game-loop-less" game engine.&lt;br /&gt;&lt;br /&gt;Ben&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4111162136844726038-1482118787172638883?l=bouliiii.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://bouliiii.blogspot.com/feeds/1482118787172638883/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4111162136844726038&amp;postID=1482118787172638883' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/1482118787172638883'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/1482118787172638883'/><link rel='alternate' type='text/html' href='http://bouliiii.blogspot.com/2011/10/yats-yet-another-tasking-system.html' title='yaTS - yet another Tasking System'/><author><name>bouliiii</name><uri>http://www.blogger.com/profile/05579514597684961397</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/_LCZOtCM95u4/SKeDch2IuRI/AAAAAAAAABA/HPFyAVikdGU/S220/talon.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4111162136844726038.post-3609221884142221306</id><published>2011-09-06T03:07:00.000-07:00</published><updated>2011-09-06T03:19:22.276-07:00</updated><title type='text'>Blog Resurrection Attempt</title><content type='html'>Hello all,&lt;br /&gt;&lt;br /&gt;This blog has been dead for a long time. I am trying to resurrect it :-)&lt;br /&gt;&lt;br /&gt;I recently decided to work on a new personal project, a small video game engine. I basically extracted some code from the very nice Intel open source project, "Embree" (from Manfred Ernst and Sven Woop, outstanding guys I had the chance to work with in the past) you may find here:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://software.intel.com/en-us/articles/embree-photo-realistic-ray-tracing-kernels/"&gt;http://software.intel.com/en-us/articles/embree-photo-realistic-ray-tracing-kernels/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Then, I quickly wrote some very basic pieces of code I threw here:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://code.google.com/p/point-frag/"&gt;http://code.google.com/p/point-frag/&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;Nothing spectular actually.&lt;br /&gt;&lt;br /&gt;Anyway, the first thing I would like to try is to build a "symetrical" multi-threading system. Basically, for several reasons (historical or because some hardware contexts are bound to one thread in particular) , games still use both "heavy" threads like the rendering thread or the main thread. The idea I would like to try is to build a tasking system where only worker threads exist (ie the main thread becomes a regular worker) and to replace every game / render loop by job pipelines.&lt;br /&gt;&lt;br /&gt;Stay tuned :-)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4111162136844726038-3609221884142221306?l=bouliiii.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://bouliiii.blogspot.com/feeds/3609221884142221306/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4111162136844726038&amp;postID=3609221884142221306' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/3609221884142221306'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/3609221884142221306'/><link rel='alternate' type='text/html' href='http://bouliiii.blogspot.com/2011/09/attempt-to-blog-resurrection.html' title='Blog Resurrection Attempt'/><author><name>bouliiii</name><uri>http://www.blogger.com/profile/05579514597684961397</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/_LCZOtCM95u4/SKeDch2IuRI/AAAAAAAAABA/HPFyAVikdGU/S220/talon.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4111162136844726038.post-9039263770861597351</id><published>2008-08-31T09:02:00.000-07:00</published><updated>2011-08-19T03:28:50.492-07:00</updated><title type='text'>Brute Force Tesselation and Quadratic Approximation of Subdivision Surfaces</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;span style="font-size:100%;"&gt;**Update** This post is somehow outdated and I just lost this small demo. Sorry for that.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;Hello there, next post!&lt;br /&gt;Here no relationship with ray tracing; on the contrary, it is all about rasterization and tesselation.&lt;br /&gt;First, some screen shots of the demo provided with this post:&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_LCZOtCM95u4/SLrndvnPoYI/AAAAAAAAABY/lmLkarXkl1Y/s1600-h/std.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://2.bp.blogspot.com/_LCZOtCM95u4/SLrndvnPoYI/AAAAAAAAABY/lmLkarXkl1Y/s400/std.jpg" alt="" id="BLOGGER_PHOTO_ID_5240755614582284674" border="0" /&gt;&lt;/a&gt;&lt;span style="font-size:85%;"&gt;The standard mesh wi&lt;/span&gt;&lt;span style="font-size:85%;"&gt;th no tesselation (2408 triangles)&lt;/span&gt;&lt;br /&gt;___&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_LCZOtCM95u4/SLrnr4kptjI/AAAAAAAAABg/dZaJXzknT4c/s1600-h/wireframe.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://1.bp.blogspot.com/_LCZOtCM95u4/SLrnr4kptjI/AAAAAAAAABg/dZaJXzknT4c/s400/wireframe.jpg" alt="" id="BLOGGER_PHOTO_ID_5240755857505498674" border="0" /&gt;&lt;/a&gt;&lt;span style="font-size:85%;"&gt;A close up on the tesselated surface in wireframe mode (10,000,000 triangles)&lt;/span&gt;&lt;span style="font-size:85%;"&gt;. About 200 millions triangles per second are displayed on my low-end 8400GS&lt;/span&gt;&lt;span style="font-size:85%;"&gt; (Here the frame rate is lower due to the wireframe mode slow on Nvidia)&lt;/span&gt;&lt;br /&gt;__&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_LCZOtCM95u4/SLroKf1XpiI/AAAAAAAAABo/CX5YKKFRPhI/s1600-h/dis1.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://1.bp.blogspot.com/_LCZOtCM95u4/SLroKf1XpiI/AAAAAAAAABo/CX5YKKFRPhI/s400/dis1.jpg" alt="" id="BLOGGER_PHOTO_ID_5240756383440676386" border="0" /&gt;&lt;/a&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_LCZOtCM95u4/SLrohhK_rfI/AAAAAAAAABw/T1WBXTmFDUc/s1600-h/dis2.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://2.bp.blogspot.com/_LCZOtCM95u4/SLrohhK_rfI/AAAAAAAAABw/T1WBXTmFDUc/s400/dis2.jpg" alt="" id="BLOGGER_PHOTO_ID_5240756778936806898" border="0" /&gt;&lt;/a&gt;&lt;span style="font-size:85%;"&gt;A simple procedural displacement is applied on the tesselated mesh&lt;/span&gt;&lt;br /&gt;__&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;This technique actually relies on several techniques developed by &lt;a href="http://user.cs.tu-berlin.de/%7Eboubek/index.html"&gt;Tamy Boubekeur&lt;/a&gt;.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-size:130%;"&gt;Patch Tesselation&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div style="text-align: justify;"&gt;The first thing to do is to tesselate the input mesh. To achieve this, we simply replace each triangle by a patch, i.e. a set of barycentric coordinate triangles. Then, instead of displaying each triangle, we replace each of them by an &lt;span style="font-style: italic;"&gt;instanciated &lt;/span&gt;patch. In the demo, we therefore create two vertex streams. The first one contains all the data for each triangle and another one simply contains the barycentric coordinates of the triangles of a patch. Then, the instanciation ability of Direct3D allows us to repeat the patch along the surface of the object. For more details about the technique, may one refer to the Graphics Hardware '05 article by Tamy &lt;a href="http://iparla.labri.fr/publications/2005/BS05/"&gt;here&lt;/a&gt;. However, in this article, no Direct3D instanciation is used.&lt;br /&gt;&lt;br /&gt;So, why is this method good? Actually, if the patches are large enough (i.e. if the tesselation rate is high), the pre-transform and post-transform caches used by the GPUs become more and more efficient so that we directly get the maximum theoretical performance of the rasterizers. For example, the frame rate does not change when you use 64 times more triangles. In fact, we can expect the same level of performance for the GPUs which will support tesselation units.without the need to use any hand-coded patches. In my 8400GS, I got 200 millions triangles per second but we may easily expect 1 billion triangles per second on DirectX11 GPUs.&lt;br /&gt;&lt;br /&gt;By the way, I use in the demo a simple optimization of the patch using the very good and simple indexation optimizer, "Tipsify" (you may find the reference &lt;a href="http://www.cse.ust.hk/%7Epsander/docs/tipsy.pdf"&gt;here&lt;/a&gt;). This reorders the patch indexation in linear time using only triangle / point adjacency while maintaining a very good level of performance (and this is much simpler and much faster that stripification algorithms like TriStrip or NvStrip).&lt;br /&gt;&lt;/div&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-size:130%;"&gt;Approximation of Subdivision Surfaces&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div style="text-align: justify;"&gt;The tesselation algorithm via the Direct3D instanciation is the core system to provide many triangles. However, the way triangles are generated (via tesselated patches) does not seem appropriate to the evaluation of subdivision surfaces since they are complex recursive algorithms. (By the way, one may find &lt;span style="font-weight: bold; font-style: italic;"&gt;excellent&lt;/span&gt; references and details on subdivision surfaces in the &lt;a href="http://en.wikipedia.org/wiki/AWESOM-O"&gt;awesomo&lt;/a&gt; tutorial located &lt;a href="http://www.mrl.nyu.edu/publications/subdiv-course2000/"&gt;here&lt;/a&gt;). In the article, QAS (located &lt;a href="http://iparla.labri.fr/publications/2007/BS07c/"&gt;here&lt;/a&gt;), Tamy proposes to replace the recursive evaluation of the subdivision surface by an analytical &lt;span style="font-style: italic;"&gt;approximation&lt;/span&gt;. The principle is pretty simple: first, project the triangles vertices and the middle of every edge on the limit surface (i.e. the subdivision surface) and compute a quadratic Bezier triangle which fits these 6 points (the vertices and the edge middle points). This leads to a pretty fast and nice approximation of subdivsion surfaces.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-size:130%;"&gt;About Adaptive Tesselation&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;In &lt;a href="http://iparla.labri.fr/publications/2008/BS08/"&gt;this&lt;/a&gt; article, Tamy proposes a very practical way to adaptively tesselate patches without cracks. This may lead for example to simple terrain rendering algorithms (if we ignore the possible memory problems which may require texture mip-mapping or page faulting systems). However, adaptive tesselation on &lt;span style="font-style: italic;"&gt;current&lt;/span&gt; hardware (or rather APIs since Radeon already have tesselation units) may be less efficient for some cases since it requires to change the patches and therefore to queue more draw calls. However, with nicer APIs (like DX11), this will lead to super efficient solutions.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-size:130%;"&gt;Some Details about Asset Production&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;What can we do with displacement? Actually, many tools already provide the assets. Using &lt;a href="http://www.pixologic.com/home.php"&gt;ZBrush&lt;/a&gt;, we already have the mesh + normal map + displacement map. Rendering this is simple: tesselate the mesh a subdivision surface approximation, displace it and light it correctly using the normal maps in the shader. Another super nice tool would to extend the mega-texture technique (the original article by Sylvain Lefebvre may be found &lt;a href="http://www.inria.fr/rrrt/rr-5210.html"&gt;here&lt;/a&gt;) to directly paint displacement on objects.  However, the QAS technique may suffer from serious drawback when the artists decide to use creasing. In a modeler like Maya, the artist may work in subdivsion surface mode. He can view the &lt;span style="font-style: italic;"&gt;tesselated &lt;/span&gt;mesh while edition the control mesh. He can then crease triangles, edges or vertices in &lt;span style="font-style: italic;"&gt;any&lt;/span&gt; level of tesselation and then see the resulting subdisvion surface. For this kind of input, the QAS algorithm requires deep modification. Nevertheless,  even if it seems necessary to provide ways to crease the geometry, none of the artists I work with produces the asset in this mode (A ZBrush approach + polygonal mode edition is generally the way they produce meshes).&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-size:130%;"&gt;The Demo&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-size:100%;"&gt;Here is the demo.&lt;/span&gt;&lt;span style="font-size:100%;"&gt; This demo does not include displacement maps but I quickly hacked a stupid procedural displacement in the shader (that you may change). Using a displacement map does not change the performance on a 8800GT (due to the large number of texture units on it). On the 8600GT I have in office, the application becomes texture bound. I do not provide the  complete source code yet since it is a bit in a dirty state. I will however try to post it ASAP.&lt;br /&gt;&lt;br /&gt;Enjoy!&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4111162136844726038-9039263770861597351?l=bouliiii.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://bouliiii.blogspot.com/feeds/9039263770861597351/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4111162136844726038&amp;postID=9039263770861597351' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/9039263770861597351'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/9039263770861597351'/><link rel='alternate' type='text/html' href='http://bouliiii.blogspot.com/2008/08/quadratic-approximation-of-subdivision.html' title='Brute Force Tesselation and Quadratic Approximation of Subdivision Surfaces'/><author><name>bouliiii</name><uri>http://www.blogger.com/profile/05579514597684961397</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/_LCZOtCM95u4/SKeDch2IuRI/AAAAAAAAABA/HPFyAVikdGU/S220/talon.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_LCZOtCM95u4/SLrndvnPoYI/AAAAAAAAABY/lmLkarXkl1Y/s72-c/std.jpg' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4111162136844726038.post-6139475199707399721</id><published>2008-08-16T16:49:00.000-07:00</published><updated>2011-11-15T02:15:00.586-08:00</updated><title type='text'>Cuda real time ray tracing - 100 millions ray/s?</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_LCZOtCM95u4/SKdvCmsMQlI/AAAAAAAAAAU/DZzHzhZ-L9M/s1600-h/shadows.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://4.bp.blogspot.com/_LCZOtCM95u4/SKdvCmsMQlI/AAAAAAAAAAU/DZzHzhZ-L9M/s320/shadows.jpg" alt="" id="BLOGGER_PHOTO_ID_5235275182377812562" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;&lt;span style="font-size:100%;"&gt;**Update** This post is somehow outdated. The code of Radius-Cuda may be found on the internet I think. Look at ompf.org or similar places. On my side, I lost the code. Sorry :-( &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Ray Tracing!&lt;br /&gt;There is a big buzz around these words nowadays and flamewars are everywhere. What I can however say is that I spent a long time writing ray tracers, Monte-Carlo renderers and: it is fun!&lt;br /&gt;&lt;br /&gt;I recently decided to write a ray tracer using &lt;a href="http://www.nvidia.com/object/cuda_home.html"&gt;Cuda&lt;/a&gt;. There were two reasons for this. The first one is that I never wrote one line of Cuda. The second one is that ray tracing was said to be slow on &lt;a href="http://www.blogger.com/en.wikipedia.org/wiki/GeForce_8"&gt;G80&lt;/a&gt; graphics cards due the heavy use of caches and stacks during the accelaration structure traversal generally required by ray tracers.&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_LCZOtCM95u4/SKdzBgRZmII/AAAAAAAAAAc/zFrvpFsdEcI/s1600-h/noshading.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://1.bp.blogspot.com/_LCZOtCM95u4/SKdzBgRZmII/AAAAAAAAAAc/zFrvpFsdEcI/s320/noshading.jpg" alt="" id="BLOGGER_PHOTO_ID_5235279561521469570" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;As usual, I first fetched all the articles I could find on the net, downloaded Cuda, installed it, runned some samples and began to code the thing. As I needed a minimalist ray tracing framework, I decided to drop my old ray tracer, yaCORT, and installed &lt;a href="https://gna.org/projects/radius/"&gt;Radius&lt;/a&gt;, a small but pretty fast ray tracer developed by Thierry Berger-Perrin. As I had moreover already developed a Bounding Volume Hierarchy on it, it was perfect to inject a Cuda ray tracer with it.&lt;br /&gt;&lt;br /&gt;I therefore decided to write a kd-tree traversal and first results were disappointing. Fortunately, with the help of Brian Budge (who posts as me, on the uber-excellent ray tracing forum &lt;a href="http://ompf.org/forum/"&gt;ompf.org&lt;/a&gt;), I managed to get quite good performance.&lt;br /&gt;&lt;/div&gt;&lt;ul style="text-align: justify;"&gt;&lt;li&gt;&lt;span style="font-size:130%;"&gt;Summary of the implementation&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ol style="text-align: justify;"&gt;&lt;li&gt;a stack per ray: no need to synchronize rays using a lot of __syncthreads(). The cost is that the stack will be necessary on local memory (&lt;span style="font-style: italic;"&gt;slow&lt;/span&gt;) but we do not need anymore to handle ray packets or things like that as done with shared stacks or SIMD ray tracing. The Cuda compiler handle itself the SIMD packing and the branches in the code.&lt;/li&gt;&lt;li&gt;Everything is in textures to cache the access to the kd-tree nodes, the triangles and so on. This is much simpler and faster than coalesing accesses to global memory since no synchro are needed.&lt;/li&gt;&lt;li&gt;KILL ALL UNIONS. Unions are bad even in C (and forbidden in C++ as soon as there is a : constructor). This command goes with this one: KILL ALL ARRAYS unless you put it on shared memory. The cuda compiler seems to put stack allocated array (stg like float my_array[4]) on local memory and accesses will just kill your app. It may work with indirection known at compile time but otherwise, this is just BAD.&lt;/li&gt;&lt;li&gt;Use dirty switches. In my implemention, I need to perform circular permutations on the coordinates (due to the kd-tree traversal and the Baduel / Wald ray-tri intersection). I tested two approaches: an shared array strategy with a simple modulo computation where all the rays are put in shared memory,  and a coding horror switch permutation (see the code). The second one was faster. The reason seems to be that in this case, a register allocation (allocating rays on the stack requires 6 more registers) is more efficient that a shared array allocation.&lt;/li&gt;&lt;/ol&gt;&lt;ul style="text-align: justify;"&gt;&lt;li&gt;&lt;span style="font-size:130%;"&gt;Performance&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ol style="text-align: justify;"&gt;&lt;li&gt;&lt;span style="font-size:100%;"&gt;The performance are quite good. I am currently developing on a laptop with the low end GeForce 8400GS but made some preliminary tests on a 8800 GT. With the last optimizations I made, I think that between 12 millions rays / s and 40 millions rays/s may be computed on a GeForce 8800GT on the demo I give in the code. I will test soon the code on a GTX280 and without being too optimistic, I think that 100 millions rays / sec may be generated on it. This may be one of the fastest ray tracers wrote on a single processor.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-size:100%;"&gt;The other good point is that the memory footprint is much smaller that a CPU ray tracer. Since the GPU is better at computing than at traversing large structures, a small kd-tree has the best performance. For 180,000 triangles, only 5MB are required for the kd-tree.&lt;/span&gt;&lt;/li&gt;&lt;/ol&gt;&lt;ul style="text-align: justify;"&gt;&lt;li&gt;&lt;span style="font-size:130%;"&gt;About Cuda&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div style="text-align: justify;"&gt;After the first deceptions due to my poor knowledge of the API and the compiler behaviour, I think that Cuda is a nice API, easy to use. G80 architectures are quite impressive and for a non-friendly application like ray tracing, speed is pretty good with no need to write one line of SIMD code. The bad point is that there is no way to have a real low-level access. After this nice experience, I am really eager to see what the upcoming &lt;a href="http://en.wikipedia.org/wiki/Larrabee_%28GPU%29"&gt;Larrabee&lt;/a&gt; will be able to do with ray tracing. The good point is that Intel will certaintly provide both a high level API for fast prototyping (a la Cuda or the promising &lt;a href="http://en.wikipedia.org/wiki/OpenCL"&gt;OpenCL&lt;/a&gt;) and a standard compiler / assembler since Larrabee cores are just x86s.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;span style="font-size:130%;"&gt;**Update** Source code is ... no more available on the previously posted link. I think you may find somewhere on ompf.org&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4111162136844726038-6139475199707399721?l=bouliiii.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://bouliiii.blogspot.com/feeds/6139475199707399721/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4111162136844726038&amp;postID=6139475199707399721' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/6139475199707399721'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/6139475199707399721'/><link rel='alternate' type='text/html' href='http://bouliiii.blogspot.com/2008/08/real-time-ray-tracing-with-cuda-100.html' title='Cuda real time ray tracing - 100 millions ray/s?'/><author><name>bouliiii</name><uri>http://www.blogger.com/profile/05579514597684961397</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/_LCZOtCM95u4/SKeDch2IuRI/AAAAAAAAABA/HPFyAVikdGU/S220/talon.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_LCZOtCM95u4/SKdvCmsMQlI/AAAAAAAAAAU/DZzHzhZ-L9M/s72-c/shadows.jpg' height='72' width='72'/><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4111162136844726038.post-4796368320525313493</id><published>2008-07-29T15:37:00.000-07:00</published><updated>2008-08-16T19:14:45.931-07:00</updated><title type='text'>Hello world</title><content type='html'>&lt;div style="text-align: justify;"&gt;Hello boys and girls.&lt;br /&gt;I decided to open a Blog to keep on coding personal small projects, reading new stuffs and learning many things on computer science. This blog is therefore a pratical way for me to present small source codes or demos I am working on. I really hope you will appreciate it. By the way, you may still find my old home page from my PhD: there are some old stuffs on it which may be a bit interesting. The URL:&lt;br /&gt;&lt;a href="http://www710.univ-lyon1.fr/%7Ebsegovia/"&gt;http://www710.univ-lyon1.fr/~bsegovia/&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4111162136844726038-4796368320525313493?l=bouliiii.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://bouliiii.blogspot.com/feeds/4796368320525313493/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4111162136844726038&amp;postID=4796368320525313493' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/4796368320525313493'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4111162136844726038/posts/default/4796368320525313493'/><link rel='alternate' type='text/html' href='http://bouliiii.blogspot.com/2008/07/hello-world.html' title='Hello world'/><author><name>bouliiii</name><uri>http://www.blogger.com/profile/05579514597684961397</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/_LCZOtCM95u4/SKeDch2IuRI/AAAAAAAAABA/HPFyAVikdGU/S220/talon.jpg'/></author><thr:total>0</thr:total></entry></feed>
