Saturday, May 9, 2009

Simdy Teapot Time

It's teapot time again..

As some may have noticed, this blog also shows the RSS feed of the RibTools project.. commits comments say little but are certainly more frequent than blog posts.
Twitter to me also seems like version control commit comments. I sort of use it that way, makes sense. I also like the fact that unlike instant messengers it doesn't have to be instantaneous, so it's not as distracting (though it's public by default 8)

Anyway, as of the commit #123, RibRender supports Larrabee native instructions using the public native instructions prototype include file.
I use the LRB native instructions (LRBni) very much like SSE instructions.. in fact they are often interchangeable !
The key to a simple port was first writing this VecN class template that defines a statically-sized vector.
When compiling for SSE, VecN has a template specialization for size 4 (4 floats) and when compiling for LRBni, VecN has a specialization for 16 floats.
The math library also defines a VecSIMDf to be either 4-floats or 16-floats depending on what is "native".

The renderer/shader portion of the code then will simply use VecSIMDf, and process data in chunks of appropriate size.
A VecSIMDf is actually seen as a scalar, and a Vec3xSIMDf is what one would consider a 3D vector.
In fact, the shader has its types defined as such:

typedef VecSIMDf SlScalar;
typedef Vec3xSIMDf SlColor;
typedef Vec2xSIMDf SlVec2;
typedef Vec3xSIMDf SlVec3;
typedef Vec4xSIMDf SlVec4;

Right now I can just change a define and decide to compile either plain C++ (thought the compiler may add SSE instructions of its own), SSE or LRNni.

I did a few tests and, unsurprisingly, SSE is the fastest by far.
The test is rendering a teapot at full screen (1600x1200 minus my right side task bar), and here are the results.

Test TypeSeconds VecSIMDf Format
NO-SSE 18.51 plain float v[1]
NO-SSE 47.735plain float v[4]
NO-SSE 1611.845plain float v[16]
SSE1.355 128 bit register
LRBni-SSE21.705 512 bit reg (simulated with SSE)
LRBni-C code25.045 512 bit reg (simulated with plain C++)

..ummummm still more places can be optimized. But for now I should switch to work on light shaders.