Saturday, May 9, 2009

Simdy Teapot Time

It's teapot time again..

As some may have noticed, this blog also shows the RSS feed of the RibTools project.. commits comments say little but are certainly more frequent than blog posts.
Twitter to me also seems like version control commit comments. I sort of use it that way, makes sense. I also like the fact that unlike instant messengers it doesn't have to be instantaneous, so it's not as distracting (though it's public by default 8)

Anyway, as of the commit #123, RibRender supports Larrabee native instructions using the public native instructions prototype include file.
I use the LRB native instructions (LRBni) very much like SSE instructions.. in fact they are often interchangeable !
The key to a simple port was first writing this VecN class template that defines a statically-sized vector.
When compiling for SSE, VecN has a template specialization for size 4 (4 floats) and when compiling for LRBni, VecN has a specialization for 16 floats.
The math library also defines a VecSIMDf to be either 4-floats or 16-floats depending on what is "native".

The renderer/shader portion of the code then will simply use VecSIMDf, and process data in chunks of appropriate size.
A VecSIMDf is actually seen as a scalar, and a Vec3xSIMDf is what one would consider a 3D vector.
In fact, the shader has its types defined as such:

typedef VecSIMDf SlScalar;
typedef Vec3xSIMDf SlColor;
typedef Vec2xSIMDf SlVec2;
typedef Vec3xSIMDf SlVec3;
typedef Vec4xSIMDf SlVec4;

Right now I can just change a define and decide to compile either plain C++ (thought the compiler may add SSE instructions of its own), SSE or LRNni.

I did a few tests and, unsurprisingly, SSE is the fastest by far.
The test is rendering a teapot at full screen (1600x1200 minus my right side task bar), and here are the results.

Test TypeSeconds VecSIMDf Format
NO-SSE 18.51 plain float v[1]
NO-SSE 47.735plain float v[4]
NO-SSE 1611.845plain float v[16]
SSE1.355 128 bit register
LRBni-SSE21.705 512 bit reg (simulated with SSE)
LRBni-C code25.045 512 bit reg (simulated with plain C++)

..ummummm still more places can be optimized. But for now I should switch to work on light shaders.


  1. So for SSE, where are the bottlenecks?

    Is it the math or the memory latency?

  2. Not sure. I haven't run it through a proper profiler yet.
    But even then, profilers don't really tell much about that overall.. being all on CPU there is no concept of uploading data anywhere, so it's harder to track.
    I suppose I could look at an overall figure pf stalls due to aces to memory that is not on cache (I would still need some costly profiler like VTune.. I have it in the office, but for now I'd rather not mix work with hobby too much.. 8)

    For one thing, lighting functions are not vectorized. I didn't bother to SIMDfy lighting because I need to change a few things later on, namely removing some fixed lighting pipeline code in favor of actual light shaders.


  3. ..actually, it seems that diffuse() and ambient() shader calls are vectorized indeed (forgot !)

    ..but I still need to implement light shaders.