Monday, March 2, 2009

Speeding up Ambient Occlusion with voxel clustering and OpenMP

Saturday I went to work for a few hours.
I wanted to finish the implementation of some major acceleration for ambient occlusion calculation.

As I mentioned before (the post below !), I'm using an hemicube rendering approach. Doing it with software rendering but taking a lot of shortcuts. For example, when rendering from the point of view of a polygon, all I care about is the visibility area (multiplied by the proper form factors).. basically how much of the "sky" or background color a polygon gets to see, and that background is really just all white. I don't really care exactly which polygon is occluding which, because they are all black (non emissive) anyway.
Because of this, I can avoid using z-buffer. It's also pretty safe to just reject polygons that would otherwise require near plane clipping (rendering from the center of a polygon, that polygon itself will be automatically discarded, which is OK because self-occlusion seems like a bad idea).

Anyhow, the biggest optimization was to settle for local occlusion (how AO is meant to be anyway. See the nice article about AO from Wikipedia.).
This works because ambient occlusion is really just meant as a quick way to get a darkening around edges. Something that is almost always true even when lights are moving in real-time (not if the object itself deforms though).

I now take a poly's bounding box, double its size in all directions and see which other polygons are in that range, then I render only those.
Quality-wise, this seems to work pretty well and it's so much faster than rendering a whole scene for every polygon.

To select neighboring polygons I built a voxel structure where each cell has a list of polygons that touch that cell.
I work directly in "world-space".. this makes this simpler when dealing with models composed of many different sub-models that have their own local transformations.
The voxel class has some sort of iterator which takes a bounding box in input and returns a list of all cells touched by that bounding box.

Something like this:

bbox = poly.GetBBox();
fatBBox = bbox.ExpandScale( 2.0f );

Voxel::Iterator it( &voxel, fatBBox );
Voxel::Cell *pCell;
while ( pCell = it.GetNextCell() )
    // render all polys in pCell

I also used again OpenMP very successfully. After a few missteps, things went OK and I got a x3.5 speed boost on 4 cores !!!
I still very much like OpenMP.. when it comes to embarrassingly parallel algorithms, it makes it so easy to drop-in multi-threading without having to split code into tasks, etc etc.
Nevertheless, I always try to keep the parallel loops as minimal as possible. Sometimes I wrap most of the complex code into a routine and then parallelize the for-loop that calls that routine.

Something along these lines worked for me..

int maxThreads = omp_get_max_threads();
RendContext *pRendCtxList = new RendContext[ maxThreads ];

#pragma omp parallel for
for (int i=0; i < polysN; ++i)
     int threadID = omp_get_thread_num();
             pRendCtxList[ threadID ],
             voxel );

delete [] pRendCtxList;

..stuff like where the renderer writes (the "render context" in this case) needs to be allocated individually for every possible thread, or else threads will be overriding each other's work !
Notice that allocation only needs to be done outside the loop.. a small complication for a potentially very ugly performance bottleneck.

There is still plenty of room for optimization, but so far, a fairly complex model (model meant for movie production) gets his vertices colored with ambient occlusion in about 2.5 minutes.
A coworker did a version of the occlusion calculation using Monte-Carlo integration (a cool way to say: "shooting rays in statistically sounds random directions") and a kd-tree structure.. and that baking takes 1 hour or more.
My AO was also clocking in that region before I implemented the voxel structure speedup.

In the end however, I'm just happy that I get to play with voxel acceleration structures and with software rendering.
The whole thing could be implemented using Direct3D or OpenGL.. but I don't like the restriction that those pose from the start.. I just want to write code for a CPU first, no extra complications, no APIs or shaders interface hoops.



  1. It's just a simple structure that keeps track of the indices into the 3D voxel.

    otherwise i'd have to use:

    for (x=xmin; x < xmax; ++x)
    ..for (y=ymin; y < ymax; ++y)
    ....for (z=zmin; z < zmax; ++z)
    ......getCell( x, y, z);

    ..and to 2 integer muls per acces I'd have to keep track of the index like

    int index = getCellIndex( xmin, ymin, zmin );
    for (x=xmin; x < xmax; ++x)
    ..for (y=ymin; y < ymax; ++y)
    ....for (z=zmin; z < zmax; ++z)
    ......getCell( index );
    ......index += 1;
    ....index += mody;
    ..index += modx;

    ..lots of work that I instead wrote once in form of Iterator (it's just a subclass) and that hopefully isn't that slow 8)

    (painful to write code in the comments !)

  2. Ah ok, I thought it was some sort of stl iterator. It looks funny!

  3. the concept is the same... only that in the case of an stl vector it's more ugly than is worth (the iterator is a pointer being incremented in 1 dimension.. no strides of sort (though i think the stl vector iterator can probably cope with itema being erased in the loop (?))

  4. HHmmm dunno, dont use stl too much, erasing items in the loop sounds inefficent since you would have to do some sort of compacting after the item was removed?

  5. yeah.. last time I did that it was terrible.
    In a loop that converted malformed quads to triangles.. it was taking 55 seconds for some large object. Once i removed the erase() inside the loop, the process wasn't even taking a second 8P

  6. This comment has been removed by a blog administrator.