Sunday, June 26, 2011

Optimizing shaders on iPad/iPhone (iOS 4)

UPDATE: The shader compiler coming with iOS 5 seems a lot more reliable. From early tests, some flaky behavior has disappeared (e.g. missing light sources are there again).
What follows is related to iOS 4...
----------------------------------------------------
In the past 2 days I dedicated some time to optimize the shaders in The Fractal.

I quite dislike the concept of "vertex shader", so I initially set lighting up to use only pixel (or fragment) shaders.

I eventually scrapped that and moved the lighting at the vertex shader level.. which gave an incredible speedup, especially on 1st gen iPad, but also on iPhone 4.
In fact, WWDC 2011 talks (I won't link it, as you need a developer account to see them anyway) on OpenGL ES seem very focused on pushing per-pixel lighting on iPad 2, almost implying that you must have been nuts to do that on any device before that.

Truth to be told, I could generally afford per-pixel lighting, because most of the screen is generally covered with a terrain with baked lighting. But when a huge monster filled the screen, the frame rate would suddenly drop.

At any rate, before switching to per-vertex lighting, I did some performance tests at the per-pixel level to better get a sense of what is slow and what is fast.
As a foreword, I have to say that conditionals and for-loops with non-constant values in shaders are considered to be a big performance penalty, and should be avoided as much as possible. But some conditionals help to keep the code (and the coder) sane.. as opposed to generating hundreds of combinations of shaders for every possible expected case.

The following tests were made on an original iPad, running iOS 4.3.3 (the iOS version determines how a shader gets compiled, as I'm compiling shaders at run-time).
These are not exhaustive tests and it's quite possible that I got something wrong. In fact, normally is the compiler that is right and the programmer that is wrong.

Looping through a list of light sources

A simple for-loop may seem the way to go:
// *SLOW*
for (int i=0; i < u_PntLightsN; ++i)
{
    [...]
}
..but in fact, it's about twice as slow as the alternative:
// *FAST*
for (int i=0; i < DE3_MAX_PNT_LIGHTS; ++i)
{
    if ( u_PntLights[i].isActive )
    {
        [...]
    }
}
Having an "active" flag per-light, while keeping the loop fixed (to 5 lights) is faster and more reliable.
That's right.. a loop with u_PntLightsN set to 2, would still only pick 1 light source (?!)
I must be dreaming ? But there is a strange work-around. When declaring the uniform variable, I tried this:

lowp int u_PntLightsN
..and suddenly I had 2 lights affecting the model again 8)

But it can be reverted by specifying "lowp" in the iterating variable too:

for (lowp int i=0; i < u_PntLightsN; ++i)
..and magically, it's only 1 light showing up, again.

So, practically the for-loop with an uniform value, other than being twice as slow, is completely unreliable ...at least in my tests, and in the pixel shader.

Back to the "active" flag. There is an important difference that will kill performance in the shaders.
In C/C++ I prefer the following equivalent form, to avoid extra indenting:
// *SLOW*
if ( ! u_PntLights[i].isActive )
    continue;

// do something
..but that's sensibly slower than the original version:
// *FAST*
if ( u_PntLights[i].isActive )
{
    // do something
}
If one considers how shading engines work, this makes perfect sense. Of course, the compiler could pick up the equivalency of the operation, but it doesn't, and it's good to keep that in mind.

Conditionals that didn't help

This is the final version that I ended up using..
// *FAST* (..relatively)
for (int i=0; i < DE3_MAX_PNT_LIGHTS; ++i)
{
    if ( u_PntLights[i].isActive )
    {
        // NOTE: CS stands for Camera Space
        vec3    posToLightPos = u_PntLights[i].posCS - posCS;
        vec3    lightDirCS = normalize( posToLightPos );

        float    NdotL = dot( norCS, lightDirCS );
        // NOTE: ooRadiusSqr is 1 / radiusSqr
float att = 1.0 - clamp( dot( posToLightPos, posToLightPos ) * u_PntLights[i].ooRadiusSqr, 0.0, 1.0 ); accLightCol += max( NdotL, 0.0 ) * att * u_PntLights[i].col; } }

..but before that I tried this:
// *SLOW*
        float    NdotL = dot( norCS, lightDirCS );

        if ( NdotL > 0 )
        {
            float    att =
                        1.0 -
                        clamp(
                            dot( posToLightPos, posToLightPos ) *
                                u_PntLights[i].ooRadiusSqr,
                            0.0,
                            1.0 );

            accLightCol += NdotL * att * u_PntLights[i].col;
        }
The "max( NdotL, 0.0 )" at the end is gone, and "if ( NDotL > 0 )" is introduced to skip a bunch of calculations.
This was slower. Not twice as slow, more like 25% slower (I don't remember the exact figure).

Conditionals on LSD

And now, for something spooky. Here are a few variations that will all give strange results:

// *BAD*
if ( u_PntLights[i].isActive == true )
{
}



..breaks the conditional and the second light is gone again !
Note that not specying "== true" is theoretically equivalent, but it's actually different (as in, "it doesn't work" 8).

And this:

// *BAD*
if ( u_PntLights[i].isActive == false )
    continue;



..is, again, theoretically equivalent but it actually gives some funky shader blocking artifacts:

..which kind of reminds me of the early(er) stages of my REYES renderer:

...ahh ..sometimes I miss working on that..  but things aren't so different now, in a sense 8)

wooooo

ADDENDUM: By running the same shaders on iPad and OpenGL on PC. I have verified that the "if" conditional is indeed not guaranteed to skip the "false" case, even in the vertex shader.
So, for my point lights loop, I must set-up those lights' colors to be null.. because
if ( u_PntLights[i].isActive ) ..is only a hint !