*Blocked SOA*, or

*BSOA*?

*SOA*stands for Structure Of Arrays:

**struct**SOAVertices

{

**float***mpX, *mpY, *mpZ;

};

SOAVertices SOAVerts;

SOAVerts.mpX =

SOAVerts.mpY =

SOAVerts.mpZ =

**new float**[ n ];SOAVerts.mpY =

**new float**[ n ];SOAVerts.mpZ =

**new float**[ n ];..and

{

};

Vertex *pAOSVerts =

Neither one worked for me.

AOS is the most intuitive. One has a Vertex structure and then makes an array out of it.

The problem is that vertices are then packed in a way that it is difficult to get into a SIMD register (fast loads are only possible when data is properly aligned).

Using 128 bit SIMD registers from the SSE instruction set, one could force the vertex structure size to be 128 bit (or 4 floats) by adding a dummy

This is wasteful, but still doable. Thinking more into the future, 16 or 32-way SIMD systems would end up requiring more padding than actual data !

Enter SOA. With SOA, a vertex is divided into 3 arrays (in this case). Now, when loading, one will load a chunk of

..on a 16-way SIMD like for Larrabee, this will load a total of 16 vertices, spread into 3 registers with only 3 fast (aligned) loads.

Of course, this is only useful if one has 16 or more vertices all grouped together.. which is usually the case !

*AOS*stands for Array Of Structures:**struct**Vertex{

**float**x, y, z;};

Vertex *pAOSVerts =

**new**Vertex [ n ];Neither one worked for me.

AOS is the most intuitive. One has a Vertex structure and then makes an array out of it.

The problem is that vertices are then packed in a way that it is difficult to get into a SIMD register (fast loads are only possible when data is properly aligned).

Using 128 bit SIMD registers from the SSE instruction set, one could force the vertex structure size to be 128 bit (or 4 floats) by adding a dummy

*w*item.This is wasteful, but still doable. Thinking more into the future, 16 or 32-way SIMD systems would end up requiring more padding than actual data !

Enter SOA. With SOA, a vertex is divided into 3 arrays (in this case). Now, when loading, one will load a chunk of

**s, a chunk of***x***s and a chunk of***y***s:***z***load**r0, SOAVerts.mpX[ fromIdx ]**load**r1, SOAVerts.mpY[ fromIdx ]**load**r2, SOAVerts.mpZ[ fromIdx ]..on a 16-way SIMD like for Larrabee, this will load a total of 16 vertices, spread into 3 registers with only 3 fast (aligned) loads.

Of course, this is only useful if one has 16 or more vertices all grouped together.. which is usually the case !

Padding is still necessary, but minimal. The memory area needs to be multiple of 16 floats and, of course, properly aligned.

**size_t**nPadded = (n + 15) / 16;

**size_t**streamSize = nPadded * sizeof(

**float**);

**size_t**alignBytes = 16 * sizeof(

**float**);

Vertices SOAVerts;

SOAVerts.mpX =

(

**float***)_aligned_malloc( streamSize, alignBytes );SOAVerts.mpY =

(

**float***)_aligned_malloc( streamSize, alignBytes );SOAVerts.mpZ =

(

**float***)_aligned_malloc( streamSize, alignBytes );One could overload new [] and delete [] as well.. can be messy, but works for me 8)

Now, I have a couple of problems with this. The first is that one ends up with 3 pointers to carry around.

So, this works better:

float *p = (float *)_aligned_malloc( streamSize

SOAVerts.mpX = p + nPadded * 0;

float *p = (float *)_aligned_malloc( streamSize

*** 3**, alignBytes );SOAVerts.mpX = p + nPadded * 0;

SOAVerts.mpY = p + nPadded * 1;

SOAVerts.mpZ = p + nPadded * 2;

SOAVerts.mpZ = p + nPadded * 2;

This is however still more complicated than it should be and also potentially slower, because

**s,**

*x***s and**

*y***s are located far away in memory, which could be problematic for cache coherence.**

*z*My solution is to group, or swizzle, chunk of 16 vertices together:

**struct**BSOAVertices

{

**float**mX[16];**float**mY[16];

**float**mZ[16];

};

size_t blocksN = (n + sizeof(BSOAVertices)-1) / sizeof(BSOAVertices);

BSOAVertices *pBSOAVerts =

_aligned_malloc( blocksN * sizeof(BSOAVertices), alignBytes );

...with this structure I always need to carry around just one pointer.

If I want to access a vertex singularly, I have to do the following:

If I want to access a vertex singularly, I have to do the following:

**size_t**blockIdx = vertIdx / 16; // '/ 16' optimizes as '>> 4'

**size_t**subIdx = vertIdx & 15;

**x = pBSOAVerts[ blockIdx ].mX[ subIdx ];**

float

float

**float**y = pBSOAVerts[ blockIdx ].mY[ subIdx ];

**float**z = pBSOAVerts[ blockIdx ].mZ[ subIdx ];

**load**r0, pBSOAVerts[ blockIdx ].mX

**load**r1, pBSOAVerts[ blockIdx ].mY

**load**r2, pBSOAVerts[ blockIdx ].mZ

These are the recent changes I made to RibRender.

The blocked vector classes are in and are used by the shading virtual machine.. but the conversion is not yet complete. Primitives dicing is still not "SIMDyfied".. but luckily I have enough abstraction to transfer all that scalar code to vectors (or vector code to matrices !).

I've been using templates quite heavily for the vector class, and I hope to be including as some point SSE or LRBni SIMD through template specialization.

At this point, I'm not even sure I care to keep around special

*Vector3*and*Vector4*classes, as most of the heavy work is going to be translated to the new layout.. and I'm guessing that 128 bit SIMD doesn't have much of a future.. can't really think in terms of*4-floats*or*3-floats+padding*anymore.I'm adding a cool image from a "partial result" during the recent changes 8)

..this post was written using Google Docs. It's really impossible to format text properly, or just to use a Courier font for code with the Blogger's editor (both the new and the old one). Hopefully the compatibility with Google Docs' created HTML is good.. hopefully !

zzzzzzzzzzzzzzzzzzzzzzzz

I skipped right to the picture and yes I agree, whatever you did - it was all worth it! :)

ReplyDeleteHey! This is *exactly* what I did on my second PS2 engine, all vertex buffers were encoded as 4 vertices mini packets stored as SOAs, because it was much more efficient to shade 4 'transposed' vertices at once :)

ReplyDeleteSebastian,

ReplyDeleteNo shader could ever do that ! 8)

Marco,

I see ! ..I guess it's same reason why one couldn't do a single dot product.

I don't remember storing them that way on PS2, however I remember storing compressed verts using 16 bit fixed point and such.

But in this case I want to reduce any sort of overhead because those samples are for a grid that needs to be shaded.. and every single op loads and stores samples (ouch ?).

ummumm

..BTW formatting is of course broken.. lots of newlines show up in Blogger but aren't in Google Docs.

ReplyDeleteStill, better than the evil Blogger editor !

Hmmmm, I forgot exactly how we are doing this on the PS3.

ReplyDeleteI think the vertex stream is SOA and vertex attributes get loaded into 128bit registers using 16byte aligned loads, but then get repacked into whatever format they need to be via register to register transfer using the shuffle instruction.

So for xyz stream something like

load:

r0 = (x0, y0, z0, x1)

r1 = (y1, z1, x2, y2)

r2 = (z2, x3, y3, z3)

then shuffle: (actual syntax is probably wrong)

r3 = shuffle(r0, r0, {0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0xA, 0xB, 0x80, 0x80, 0x80, 0x80})

r4 = shuffle(r0, r1, {0xC, 0xD, 0xE, 0xF, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x80, 0x80, 0x80, 0x80})

Its been a long time since I looked at that code, but I will double check!!!!

I see.. using the shuffles !

ReplyDeleteIn the meantime I looked into my PS2 code and I found out a "vftoi12" on the CPU side ..that's how I was compressing normals I guess 8)