Saturday, May 2, 2009

#pragma omp

I have reorganized RibRender to go multi-threaded and did go MT using OpenMP.

Once data conflicts are out of the way, with the bucket (screen areas of work) system in place, all it took to get a x2.5 performance improvement on a 4 cores machine was a well placed #pragma omp for .

Something like this:

#pragma omp parallel for
for (int bi=0; bi < bucketsN; ++bi)
    renderBucket_s( mHider, mHider.mpBuckets[ bi ] );

It was important however to make sure that primitives are assigned to buckets outside the loop.
This means that the simplify and split phases are outside the OpenMP-threaded loop.
Simplify and split could be multi-threaded themselves, but so far the results show that the actual bucket rendering (currently only dicing and shading) is the slowest by far.
We'll have to see later with complex scenes.. instead of basic test files.

There is still probably an unhealthy amount of overwork that every bucket has to do while shading samples that belong to the neighbors.. but so far it's nice to know that I have a system that will automatically scale to a larger number of cores.. now I can't wait for hardware makers to come up with 16, 32, 64 cores !

Another thing that I did, was removing some abstraction. I started writing code based on Production Rendering 's suggestions... but as I'm starting to understand more and diverge from the book's approach, abstract interface classes only complicate refactoring and also code navigation (even Visual Assist X has problems finding references to a specific implementation of a virtual function call).

Anyhow, now I have to decide the next step: implement light shaders, try to render more complex scenes or tackle the micro-polygon sampling (and ditch z-buffer for order-independent translucency).

There is also the matter of speeding up shading.. which is currently SIMD-ready but that it doesn't yet use any sort of hardware SIMD... for that I could try SSE2 or the publicly avaiable Larrabee Prototype Library ..which it tries to be fast using SSE2, but that probably ends up slowing things down compared to using SSE2 directly.
I could also try and see about OpenCL.. provided that I can get the alpha from NVidia, which would require me to ask as a registered developer.. but then it wouldn't be for me to play at home.. ;) not to mention that my current PC has an ATI card 8)

There is also the DirectX Compute Shaders.. but the samples run dead slow on my card and I also usually prefer to stay away from Microsoft's APIs.. using DirectX to do a software renderer somehow doesn't sound right 8)

Generally, I'd rather to stay on the CPU.. at least until I have a clear idea of what exactly a RenderMan shader needs to do.