Saturday, May 2, 2009

#pragma omp

I have reorganized RibRender to go multi-threaded and did go MT using OpenMP.

Once data conflicts are out of the way, with the bucket (screen areas of work) system in place, all it took to get a x2.5 performance improvement on a 4 cores machine was a well placed #pragma omp for .

Something like this:

#pragma omp parallel for
for (int bi=0; bi < bucketsN; ++bi)
    renderBucket_s( mHider, mHider.mpBuckets[ bi ] );

It was important however to make sure that primitives are assigned to buckets outside the loop.
This means that the simplify and split phases are outside the OpenMP-threaded loop.
Simplify and split could be multi-threaded themselves, but so far the results show that the actual bucket rendering (currently only dicing and shading) is the slowest by far.
We'll have to see later with complex scenes.. instead of basic test files.

There is still probably an unhealthy amount of overwork that every bucket has to do while shading samples that belong to the neighbors.. but so far it's nice to know that I have a system that will automatically scale to a larger number of cores.. now I can't wait for hardware makers to come up with 16, 32, 64 cores !

Another thing that I did, was removing some abstraction. I started writing code based on Production Rendering 's suggestions... but as I'm starting to understand more and diverge from the book's approach, abstract interface classes only complicate refactoring and also code navigation (even Visual Assist X has problems finding references to a specific implementation of a virtual function call).

Anyhow, now I have to decide the next step: implement light shaders, try to render more complex scenes or tackle the micro-polygon sampling (and ditch z-buffer for order-independent translucency).

There is also the matter of speeding up shading.. which is currently SIMD-ready but that it doesn't yet use any sort of hardware SIMD... for that I could try SSE2 or the publicly avaiable Larrabee Prototype Library ..which it tries to be fast using SSE2, but that probably ends up slowing things down compared to using SSE2 directly.
I could also try and see about OpenCL.. provided that I can get the alpha from NVidia, which would require me to ask as a registered developer.. but then it wouldn't be for me to play at home.. ;) not to mention that my current PC has an ATI card 8)

There is also the DirectX Compute Shaders.. but the samples run dead slow on my card and I also usually prefer to stay away from Microsoft's APIs.. using DirectX to do a software renderer somehow doesn't sound right 8)

Generally, I'd rather to stay on the CPU.. at least until I have a clear idea of what exactly a RenderMan shader needs to do.



  1. Just a bit longer before Larrabee. I'd say just code it for that and to hell with Nvidia and ATI

  2. I couldn't agree more, swtich to LRBni and forget old GPUs.

  3. woooo

    To me it's more like "to hell with everything". I want no crippling limitations from any external hardware.
    Larrabee sounds good on paper, but the instruction set prototype doesn't say the whole story.. I assume it's still a separate piece of hardware and it will require data transfer, resource management through the Windows Driver Model.. etc.

    At work I have the privilege to play with some unreleased or even not existing hardware, but at home I have to settle for what's in the market.. if I didn't, it would be work and then I couldn't be talking of what I'm doing 8)

    It's also important to consider the installed base. Although I'm not trying to sell anything, I still prefer to be sharing something that will run efficiently on many machines.
    Even after Larrabee comes out, it's not like NVidia and ATI are going to disappear.. so, at a practical level, much GPGPU usage is going to have to be throught OpenCL or DX Compute Shaders.

    ..but if any HW maker wants to sponsor my efforts, then I'll be happy to favor a proprietary implementation 8)