Wednesday, May 6, 2009

BS.. BSAO ??

I'm not sure how to call this.. let's say Blocked SOA, or BSOA ?

SOA stands for Structure Of Arrays:

struct SOAVertices
float *mpX, *mpY, *mpZ;
SOAVertices SOAVerts;
SOAVerts.mpX = new float[ n ];
SOAVerts.mpY = new float[ n ];
SOAVerts.mpZ = new float[ n ];

..and AOS stands for Array Of Structures:

struct Vertex
float x, y, z;
Vertex *pAOSVerts = new Vertex [ n ];

Neither one worked for me.
AOS is the most intuitive. One has a Vertex structure and then makes an array out of it.
The problem is that vertices are then packed in a way that it is difficult to get into a SIMD register (fast loads are only possible when data is properly aligned).
Using 128 bit SIMD registers from the SSE instruction set, one could force the vertex structure size to be 128 bit (or 4 floats) by adding a dummy w item.
This is wasteful, but still doable. Thinking more into the future, 16 or 32-way SIMD systems would end up requiring more padding than actual data !

Enter SOA. With SOA, a vertex is divided into 3 arrays (in this case). Now, when loading, one will load a chunk of xs, a chunk of ys and a chunk of zs:

load r0, SOAVerts.mpX[ fromIdx ]
load r1, SOAVerts.mpY[ fromIdx ]
load r2, SOAVerts.mpZ[ fromIdx ]

..on a 16-way SIMD like for Larrabee, this will load a total of 16 vertices, spread into 3 registers with only 3 fast (aligned) loads.
Of course, this is only useful if one has 16 or more vertices all grouped together.. which is usually the case !

Padding is still necessary, but minimal. The memory area needs to be multiple of 16 floats and, of course, properly aligned.

size_t nPadded = (n + 15) / 16;
size_t streamSize = nPadded * sizeof(float);
size_t alignBytes = 16 * sizeof(float);

Vertices SOAVerts;

SOAVerts.mpX =
  (float *)_aligned_malloc( streamSize, alignBytes );

SOAVerts.mpY =
  (float *)_aligned_malloc( streamSize, alignBytes );

SOAVerts.mpZ =
  (float *)_aligned_malloc( streamSize, alignBytes );

One could overload new [] and delete [] as well.. can be messy, but works for me 8)

Now, I have a couple of problems with this. The first is that one ends up with 3 pointers to carry around.
So, this works better:

float *p = 
(float *)_aligned_malloc( streamSize * 3alignBytes );

SOAVerts.mpX = p + 
nPadded * 0;
SOAVerts.mpY = p + nPadded * 1;
SOAVerts.mpZ = p + nPadded * 2;, one only needs the first pointer "p" and "nPadded" to rebuild the pointers.

This is however still more complicated than it should be and also potentially slower, because xs, ys and zs are located far away in memory, which could be problematic for cache coherence.

My solution is to group, or swizzle, chunk of 16 vertices together:

struct BSOAVertices
float mX[16];
    float mY[16];
    float mZ[16];

size_t blocksN = (n + sizeof(BSOAVertices)-1) / sizeof(BSOAVertices);

BSOAVertices *pBSOAVerts =
    _aligned_malloc( blocksN * sizeof(BSOAVertices)alignBytes );

...with this structure I always need to carry around just one pointer.
If I want to access a vertex singularly, I have to do the following:

size_t blockIdx = vertIdx / 16;  // '/ 16' optimizes as '>> 4'
size_t subIdx   = vertIdx & 15;

x = pBSOAVerts[ blockIdx ].mX[ subIdx ];
float y = pBSOAVerts[ blockIdx ].mY[ subIdx ];
float z = pBSOAVerts[ blockIdx ].mZ[ subIdx ];'s not pretty, but normally one really just wants to work the SIMD way, and load as such:

load r0, pBSOAVerts[ blockIdx ].mX
load r1, pBSOAVerts[ blockIdx ].mY
load r2, pBSOAVerts[ blockIdx ]

In essence, all I'm doing is working with 3x16 matrices !

These are the recent changes I made to RibRender.
The blocked vector classes are in and are used by the shading virtual machine.. but the conversion is not yet complete. Primitives dicing is still not "SIMDyfied".. but luckily I have enough abstraction to transfer all that scalar code to vectors (or vector code to matrices !).

I've been using templates quite heavily for the vector class, and I hope to be including as some point SSE or LRBni SIMD through template specialization.
At this point, I'm not even sure I care to keep around special Vector3 and Vector4 classes, as most of the heavy work is going to be translated to the new layout.. and I'm guessing that 128 bit SIMD doesn't have much of a future.. can't really think in terms of 4-floats or 3-floats+padding anymore.

I'm adding a cool image from a "partial result" during the recent changes 8)

..this post was written using Google Docs. It's really impossible to format text properly, or just to use a Courier font for code with the Blogger's editor (both the new and the old one). Hopefully the compatibility with Google Docs' created HTML is good.. hopefully !



  1. I skipped right to the picture and yes I agree, whatever you did - it was all worth it! :)

  2. Hey! This is *exactly* what I did on my second PS2 engine, all vertex buffers were encoded as 4 vertices mini packets stored as SOAs, because it was much more efficient to shade 4 'transposed' vertices at once :)

  3. Sebastian,
    No shader could ever do that ! 8)

    I see ! ..I guess it's same reason why one couldn't do a single dot product.
    I don't remember storing them that way on PS2, however I remember storing compressed verts using 16 bit fixed point and such.

    But in this case I want to reduce any sort of overhead because those samples are for a grid that needs to be shaded.. and every single op loads and stores samples (ouch ?).


  4. ..BTW formatting is of course broken.. lots of newlines show up in Blogger but aren't in Google Docs.

    Still, better than the evil Blogger editor !

  5. Hmmmm, I forgot exactly how we are doing this on the PS3.

    I think the vertex stream is SOA and vertex attributes get loaded into 128bit registers using 16byte aligned loads, but then get repacked into whatever format they need to be via register to register transfer using the shuffle instruction.

    So for xyz stream something like

    r0 = (x0, y0, z0, x1)
    r1 = (y1, z1, x2, y2)
    r2 = (z2, x3, y3, z3)

    then shuffle: (actual syntax is probably wrong)

    r3 = shuffle(r0, r0, {0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0xA, 0xB, 0x80, 0x80, 0x80, 0x80})

    r4 = shuffle(r0, r1, {0xC, 0xD, 0xE, 0xF, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x80, 0x80, 0x80, 0x80})

    Its been a long time since I looked at that code, but I will double check!!!!

  6. I see.. using the shuffles !

    In the meantime I looked into my PS2 code and I found out a "vftoi12" on the CPU side ..that's how I was compressing normals I guess 8)