*Blocked SOA*, or

*BSOA*?

*SOA*stands for Structure Of Arrays:

**struct**SOAVertices

{

**float***mpX, *mpY, *mpZ;

};

SOAVertices SOAVerts;

SOAVerts.mpX =

SOAVerts.mpY =

SOAVerts.mpZ =

**new float**[ n ];SOAVerts.mpY =

**new float**[ n ];SOAVerts.mpZ =

**new float**[ n ];..and

{

};

Vertex *pAOSVerts =

Neither one worked for me.

AOS is the most intuitive. One has a Vertex structure and then makes an array out of it.

The problem is that vertices are then packed in a way that it is difficult to get into a SIMD register (fast loads are only possible when data is properly aligned).

Using 128 bit SIMD registers from the SSE instruction set, one could force the vertex structure size to be 128 bit (or 4 floats) by adding a dummy

This is wasteful, but still doable. Thinking more into the future, 16 or 32-way SIMD systems would end up requiring more padding than actual data !

Enter SOA. With SOA, a vertex is divided into 3 arrays (in this case). Now, when loading, one will load a chunk of

..on a 16-way SIMD like for Larrabee, this will load a total of 16 vertices, spread into 3 registers with only 3 fast (aligned) loads.

Of course, this is only useful if one has 16 or more vertices all grouped together.. which is usually the case !

*AOS*stands for Array Of Structures:**struct**Vertex{

**float**x, y, z;};

Vertex *pAOSVerts =

**new**Vertex [ n ];Neither one worked for me.

AOS is the most intuitive. One has a Vertex structure and then makes an array out of it.

The problem is that vertices are then packed in a way that it is difficult to get into a SIMD register (fast loads are only possible when data is properly aligned).

Using 128 bit SIMD registers from the SSE instruction set, one could force the vertex structure size to be 128 bit (or 4 floats) by adding a dummy

*w*item.This is wasteful, but still doable. Thinking more into the future, 16 or 32-way SIMD systems would end up requiring more padding than actual data !

Enter SOA. With SOA, a vertex is divided into 3 arrays (in this case). Now, when loading, one will load a chunk of

**s, a chunk of***x***s and a chunk of***y***s:***z***load**r0, SOAVerts.mpX[ fromIdx ]**load**r1, SOAVerts.mpY[ fromIdx ]**load**r2, SOAVerts.mpZ[ fromIdx ]..on a 16-way SIMD like for Larrabee, this will load a total of 16 vertices, spread into 3 registers with only 3 fast (aligned) loads.

Of course, this is only useful if one has 16 or more vertices all grouped together.. which is usually the case !

Padding is still necessary, but minimal. The memory area needs to be multiple of 16 floats and, of course, properly aligned.

**size_t**nPadded = (n + 15) / 16;

**size_t**streamSize = nPadded * sizeof(

**float**);

**size_t**alignBytes = 16 * sizeof(

**float**);

Vertices SOAVerts;

SOAVerts.mpX =

(

**float***)_aligned_malloc( streamSize, alignBytes );SOAVerts.mpY =

(

**float***)_aligned_malloc( streamSize, alignBytes );SOAVerts.mpZ =

(

**float***)_aligned_malloc( streamSize, alignBytes );One could overload new [] and delete [] as well.. can be messy, but works for me 8)

Now, I have a couple of problems with this. The first is that one ends up with 3 pointers to carry around.

So, this works better:

float *p = (float *)_aligned_malloc( streamSize

SOAVerts.mpX = p + nPadded * 0;

float *p = (float *)_aligned_malloc( streamSize

*** 3**, alignBytes );SOAVerts.mpX = p + nPadded * 0;

SOAVerts.mpY = p + nPadded * 1;

SOAVerts.mpZ = p + nPadded * 2;

SOAVerts.mpZ = p + nPadded * 2;

This is however still more complicated than it should be and also potentially slower, because

**s,**

*x***s and**

*y***s are located far away in memory, which could be problematic for cache coherence.**

*z*My solution is to group, or swizzle, chunk of 16 vertices together:

**struct**BSOAVertices

{

**float**mX[16];**float**mY[16];

**float**mZ[16];

};

size_t blocksN = (n + sizeof(BSOAVertices)-1) / sizeof(BSOAVertices);

BSOAVertices *pBSOAVerts =

_aligned_malloc( blocksN * sizeof(BSOAVertices), alignBytes );

...with this structure I always need to carry around just one pointer.

If I want to access a vertex singularly, I have to do the following:

If I want to access a vertex singularly, I have to do the following:

**size_t**blockIdx = vertIdx / 16; // '/ 16' optimizes as '>> 4'

**size_t**subIdx = vertIdx & 15;

**x = pBSOAVerts[ blockIdx ].mX[ subIdx ];**

float

float

**float**y = pBSOAVerts[ blockIdx ].mY[ subIdx ];

**float**z = pBSOAVerts[ blockIdx ].mZ[ subIdx ];

**load**r0, pBSOAVerts[ blockIdx ].mX

**load**r1, pBSOAVerts[ blockIdx ].mY

**load**r2, pBSOAVerts[ blockIdx ].mZ

These are the recent changes I made to RibRender.

The blocked vector classes are in and are used by the shading virtual machine.. but the conversion is not yet complete. Primitives dicing is still not "SIMDyfied".. but luckily I have enough abstraction to transfer all that scalar code to vectors (or vector code to matrices !).

I've been using templates quite heavily for the vector class, and I hope to be including as some point SSE or LRBni SIMD through template specialization.

At this point, I'm not even sure I care to keep around special

*Vector3*and*Vector4*classes, as most of the heavy work is going to be translated to the new layout.. and I'm guessing that 128 bit SIMD doesn't have much of a future.. can't really think in terms of*4-floats*or*3-floats+padding*anymore.I'm adding a cool image from a "partial result" during the recent changes 8)

..this post was written using Google Docs. It's really impossible to format text properly, or just to use a Courier font for code with the Blogger's editor (both the new and the old one). Hopefully the compatibility with Google Docs' created HTML is good.. hopefully !

zzzzzzzzzzzzzzzzzzzzzzzz