Search Unity

Need help - afraid of Premature optimization to mobile uber shader writing....

Discussion in 'Shaders' started by colin299, Mar 29, 2017.

  1. colin299

    colin299

    Joined:
    Sep 2, 2013
    Posts:
    181
    found 2 pdfs about Non-hardware specific hlsl shader optimizations tricks, wonder if the following are still true for mobiles?

    First read:
    http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/Dark_Secrets_of_shader_Dev-Mojo.pdf

    summary 15 shader optimization tricks
    1-Use Intrinsic Functions (make sense)
    2-Properly Use Data Types (seems contradict with #3?)
    3-Reduce Typecasting (seems contradict with #2?)
    4-Avoid Integer Calculations (make sense)
    5-Use Integers For Indexing (make sense)
    6-Pack Scalar Constants (make sense)
    7-Pack Arrays of Constants (not sure)
    8-Properly Declare Constants (not sure)
    9-Vectorize Calculations (make sense, but not sure)
    10-Vectorize Even More (make sense)
    11-Vectorize Comparisons (make sense)
    12-Careful With Matrix Transpose (only used square matrix before, so not sure)
    13-Use Swizzles Wisely (can not understand)
    14-Use 1D Texture Fetches (make sense)
    15-Use Signed Textures (signed normal map exists in mobile platform?)

    wonder should I apply all these optimizations to my mobile shader(mostly for Opengles2/3)?
    ----------------------------------------------------------------------------------
    Second read:
    http://www.humus.name/Articles/Persson_LowLevelThinking.pdf (the difficulty of this pdf is beyond my shader knowledge, not sure if I should apply this direcly to mobile, because hlsl/cg is translated to glsl by Unity, I saw the translated glsl code, it generates very different code for ES2 & ES3)

    some optimization tricks mentioned in this pdf,
    -make calculation in MAD form = just 1 assembly level instruction (confirmed by bgolus), which means
    Code (CSharp):
    1. x * a + b // 1 MAD. good
    2. (x + a) * b //1 ADD, 1 MUL. bad
    -seperate scalar & vector calculations if possible (confirmed by bgolus), which means
    Code (CSharp):
    1. vector4 * scalar * vector4 * scalar;//slow
    2. (vector4 * vector4) * (scalar * scalar);//better
    -abs() neg() for input, saturate() for output is free (confirmed by bgolus)
    Code (CSharp):
    1. float c = saturate(abs(a) + abs(b)); //1 ADD, all abs & saturate is free
    2. float c = abs(a+b); //1ADD, 1MOV, bad
    3. float c = -a * b; //1 MUL, free negative
    4. float c = -(a*b); //1 MUL, 1MOV, bad
    5. float c = saturate(1-a);//1 ADD, free saturate
    6. float c = 1 - saturate(a);//1MOV, 1ADD, bad
    -conditional assignment is faster then sign() or step()
    Code (CSharp):
    1. (x >= 0) ? 1 : -1;//fast
    2. sign(x);//so no reason to write this
    3.  
    4. (x > 0 ) ? 1 : 0; //fast
    5. step(x);//so no reason to write this
    is the above "best-practice" still correct if targeting ES2/ES3 (glsl)?
    ----------------------------------------------------------------------------------
    currently our fragment shader optimization stick to the following rules:

    -calculate all uv data in vertex shader, use them in fragment shader directly(passing uv via varying to avoiding dependent texture read(maybe it is only useful for es2 PowerVR TBDR gpus? which is iPhone4S,iPhone5,iPad2.....only)
    Even end up using more varyings (currently use up all ES2's 8 TEXCOORD varyings to pass uvs / packed scalars, also a few COLORs, not sure if it will cause shader interpolator bound...)

    see this for more info about dependent texture read:
    https://developer.apple.com/library...cticesforShaders/BestPracticesforShaders.html
    Code (CSharp):
    1. struct v2f{
    2.  half2 uv : TEXCOORDn; //pass uvs direcly
    3.  half4 packed2uvs : TEXCOORDn; //afraid of dependent texture read, avoid packing 2 uvs into 1 varying
    4.  half4 packed4Scalars : TEXCOORDn; //for general purpose scalar packing
    5. }
    6.  
    7. tex2D(_Texture,i.uv); //good, fastest
    8. tex2D(_Texture,i.uv.xy); //afraid of dependent texture read, depends on driver?. avoid
    9. tex2D(_Texture,i.uv.zw); //afraid of dependent texture read. avoid
    -combine multiple vertex light's calculation in vertex shader just to make lighting independent to #of pixels

    -if alphatest(clip function) is needed , execute as early as possible = call clip() asap
    Code (CSharp):
    1. fixed4 mainTex = tex2D(_MainTex,i.uv);
    2. clip(mainTex.a - _AlphaTestThersold); //call immidiately once we know the alpha value
    ----------------------------------------------------------------------------------
    -seperate fragment shader to 3 parts, just to lower data dependency to minimum:

    first, read all textures to temp variables, so when GPU is waiting the texture data, it can switch to do other calculation(not sure if it is true)

    second, do all maths using the fetched textures above, store result in seperate temp variable also

    finally, use all result temp variables to calculate final result color
    *Not sure if using more variable to store temp result is actually causing trouble(register count limit?)
    ----------------------------------------------------------------------------------
    (confirmed by bgolus)
    -use fixed for everything in fragment shader to avoid unexpected swizzle(make sure all calculation is within -2~2 range, value outside -2~2 is not always clamped inside -2~2).
    (confirmed by bgolus)
    -use the smallest data type whenever possible (fixed3 instead of fixed4 if alpha color is not needed), for example if the result do not care about alpha, do all calculation in fixed3, then return it as fixed4 finally
    Code (CSharp):
    1. fixed3 resultColor = ....;
    2. //many calculations, add or mul to resultColor
    3. return fixed4(resultColor,1); //don't care alpha output
    -use #if #endif,shader_feature,multi_compile to turn on/off the only calculation needed, no if() in hlsl
    ----------------------------------------------------------------------------------
    wonder if I am doing something wrong / not useful for optimizing my fragment shader?

    I use these pdfs to help me when writing shaders, wonder if there are other useful resources?
    https://blogs.unity3d.com/wp-content/uploads/2011/08/FastMobileShaders_siggraph2011.pdf
    http://www.realtimerendering.com/downloads/MobileCrossPlatformChallenges_siggraph.pdf

    because the whole game use a shared uber shader for 90% of the rendering, the fragment part of this uber shader is really worth optimizing. I know that my current situation is fragment bound(not sure ALU bound / tex fetch bound) because it runs faster if I lower the resolution and no performance gain if all textures replaced by 2x2 textures, so I wrote this post, try to ask for help before any premature optimizations.

    currently testing on:
    -Mali400MP - GALAXY Note II LTE - Samsung ES2
    -Adreno305 - HTC Desire 816 ES3
    -Galaxy Tab A (2016) with S Pen - Mali T830 ES3.1
     
    Last edited: Mar 29, 2017
    DonCornholio likes this.
  2. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,342
    A lot of the first few papers you linked to are desktop hardware specific. That first one (AMD's Dark Secrets) predates the existence current mobile GPU hardware, so not everything there is accurate for mobile, but a lot of it is probably still relevant.

    Not always true. Some people are finding using older Nvidia CG implementations are faster than using AMD's intrinsic functions for example. For mobile however this is probably true as most of the intrinsic functions are usually more heavily optimized for speed where AMD's GCN are trying to balance speed and accuracy.

    Still true for some mobile GPUs, but PowerVR (ie: Apple devices) since iPhone 5S have native integer hardware so this isn't true anymore. Desktop has had integer hardware for quite a while too. Some operations are still more expensive than the float equivalent, but conversion between int and float is also free on a lot of modern hardware, so (int)((float)myInt / 3) might be faster than myInt / 3. I would say it's still a good rule, but more complicated now.

    Unlikely to be true for today, but also probably doesn't hurt anything, mainly because of #9.

    Was true for a long time for mobile, current generation mobile hardware it appears only Mali GPUs still prefer vectorization. Modern Adreno, PowerVR, Nvidia, and AMD GPUs all appear to gain nothing from vectorizing your math, and can hurt performance in some cases, especially on AMD.

    Ignore, no longer valid as far as I know. We're well beyond PixelShader 2.0, even with OpenGLES 2.0

    A 1D texture lookup doesn't exist in OpenGL, or really any modern graphics API I know of, they're all just going to be remapped to tex2D. I don't even know if Unity will let you compile a shader that uses "tex1D".

    Some mobile platforms can do signed textures, like PowerVR ... but the bigger issue is Unity doesn't support defining signed textures, so it's moot.

    The second paper from Persson has a lot that can be ignored for mobile as that's a lot of very GCN specific stuff (AMD's current architecture from PS4/XBone/Radeon HD 7000 or later) and a lot doesn't translate to mobile hardware.

    However the stuff about mad is true for all hardware. Grouping vector and scalar separately is true for most if not all hardware. Using abs / saturate like described is good for all hardware. Most modern platforms are going to compile sign / step to exactly the same shader code as the lines above them, but some mobile platforms might not!

    The really hard part for figuring out how a mobile shader compiles depends on the specific device it's running on. Two phones from the same company running similar hardware could potentially run the shader completely differently. :(

    The short version for dependent texture reads is just don't modify the UVs in the pixel shader. Swizzling or packing is totally fine. Mali, Adreno, and PowerVR are all going to benefit from this. They probably do need to be coming from TEXCOORDn and not COLORn to avoid dependent texture reads, but I've never looked too deep into that.

    No! Never ever use clip() on mobile unless you absolutely have to! On desktop the above statement is true, but on mobile the entire shader is still computed regardless of where clip() occurs and will cause a stall to occur. You're better off using alpha blending, even step(alpha, 0.5), than clip(). A lot of this is due to how mobile z buffers work, using clip() disables a ton of optimization and pre-z culling (one of the reasons why you always render alpha test after everything else).

    Only potentially true if you're causing dependent texture reads. If not, otherwise the data is already there.

    This is definitely a good practice, especially using fixed or half when you can as these actually mean something. On desktop fixed / half / float are often actually the same thing. Using fixed3 instead of fixed4 is going to be good for any platform that doesn't care about vectorization (which, again, is most of them today), it's actually less of a concern for older hardware since a fixed3 * fixed3 might actually be the same cost as a fixed4 * fixed4, even though the first document seems to suggest otherwise.
     
  3. colin299

    colin299

    Joined:
    Sep 2, 2013
    Posts:
    181
  4. AcidArrow

    AcidArrow

    Joined:
    May 20, 2010
    Posts:
    11,735
    How sure are you about this?

    I recently started packing all my uv data, but I started looking into it and I don't think it was a good idea, at least for mobile.

    Apple says on their shader docs that: (from here : https://developer.apple.com/library...cticesforShaders/BestPracticesforShaders.html )

    They do say that in Open GL ES 3.0 capable Apple devices, depended texture reads have no overhead though.

    I'm trying to find if that's true about OpenGL ES 3.0 in general, but I can't find anything concrete. So I'm not sure if the behaviour is the same on Android. Maybe it depends on gfx vendor/driver?
     
  5. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,342
    Yep, it looks like swizzling and packing is out. Some old desktop GPUs could do it that way and I incorrectly assumed mobile could as well. It's possible some mobile GPUs are capable of it, but it certainly appears that OpenGLES 2.0 class PowerVR GPUs (on Apple devices) cannot.

    The "no overhead" from dependent texture reads for ES 3.0 is nice, but I've not had the chance to look into it too much since my last mobile project was for Daydream VR and I had very limited time to get stuff done for that so i stuck to single texture, vertex only manipulated UV shaders. Would be interesting on my next Daydream project to experiment a bit more there.

    However as best I can tell it really is a requirement of ES 3.0 to reduce or remove the overhead from dependent texture reads.
     
  6. Kumo-Kairo

    Kumo-Kairo

    Joined:
    Sep 2, 2013
    Posts:
    343
    Regarding "free" dependent texture reads on OpenGL ES 3.0 - have you guys found any info in why is it true and where this performance generosity comes from? Best I could find is that "dependent texture reads are free on OpenGL ES 3.0" and nothing more.
    I'm going to set a few tests myself to see how much is it true on a few devices and will write back here anyway
     
    colin299 likes this.
  7. nat42

    nat42

    Joined:
    Jun 10, 2017
    Posts:
    353
    I think the closest I recall seeing to an answer was an explanation on why they were more expensive on GLES2, and then reading between the lines that with improvements in bandwidth and increased shader complexity how a hardware optimisation for independent reads is less useful.

    Will be interested to see the results ;)
     
    Kumo-Kairo likes this.
  8. Kumo-Kairo

    Kumo-Kairo

    Joined:
    Sep 2, 2013
    Posts:
    343
    Hi guys, got some info.
    I've talked to a graphics guru guy today and asked him about these dependent texture reads and stuff. Among other things he said that there's such thing as parallel ALU and TEX blocks on the GPU. His insights on this was that on modern GPUs (at least on iOS starting from IPhone 4s) complete texture prefetches are not common at all. Calling tex2D in a fragment shader still incurs some work to be done in that fragment shader and not before that (not in the tiler/vertex shader). This is why if we check Adreno profiler's shader analysis of simple calculation shaders, they usually don't have any no-ops at all, but adding one texture fetch adds 6 no-ops to the shader execution no matter what.
    This is because GPU actually needs those 6 clock cycles to read a texture in a fragment shader (it depends heavily on hardware, my PowerVR SGX544MP tablet needs only 4), and if we don't have any arithmetic operations for those 6 clock cycles or they can't be run in parallel with the texture fetch, we get a stall.
    This made me curious about why is then we have this performance difference between DTRs and Non-DTRs. The reason is that GPU can't just take a texture coordinate and get a texel value from a texture, it needs to calculate some gradients, derivatives and such, and this is what actually gets calculated before the fragment shader. DTRs on the other hand don't allow GPU to make those precalculations and when we try to fetch a texel value with a tex2D function, it needs to calculate those parameters first and it can't always make these calculations in parallel, so it incurs a whole lot of GPU stalling, waiting for some block to do its calculations first. And this is the main reason for DTRs to be so slow on OpenGL ES 2.0 devices.

    There's also an interesting thing about OpenGL ES 3.0 and DTRs. If I were a bit more attentive, I would have found that the most important part of the quote about DTRs from that Apple's "Best Practices for Shaders" was that it's not about the API level itself, it's about hardware:
    So it doesn't matter which API is used, this improvements are done on a hardware level, not on a OpenGL driver level. This is why my tests with switching between API levels on the same device didn't have any noticeable performance differences.

    Unfortunately, I don't have any external resources to prove these things 100% right (except the Adreno profiler shader analysis tool), but I'm interested in it myself and I'm still poking the guy for these external resources (articles, talks, whatever), and I'll post it here when I get some.
     
    Last edited: Dec 7, 2017
    sylon, colin299 and bgolus like this.
  9. Invertex

    Invertex

    Joined:
    Nov 7, 2013
    Posts:
    1,549
    One thing related to DTRs that I haven't had the chance to test yet is whether creating a local variable to store the UV value in the frag shader, but not actually alter the UV value, will still cause it to become a DTR, or if the compiler is smart enough to see "hey, the value isn't being altered so just replace the reference with the one that it got the value from".

    I often use a reassignment to organize code but it makes me worry this could be a big issue.
     
  10. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,342
    Local values don’t exist in a compiled shader.
     
  11. Invertex

    Invertex

    Joined:
    Nov 7, 2013
    Posts:
    1,549
    Just to clarify in case there's confusion, I mean local to the function like this:


    Code (CSharp):
    1. void surf(Input IN, inout SurfaceOutputStandardSpecular o)
    2. {
    3.     half2 mainUV = IN.uv_BlendMap; //vert shader value stored in local variable force a DTR to happen?
    4.  
    5.     float4 blendValues = UNITY_SAMPLE_TEX2D(_BlendMap, mainUV);
    6.     //....
    7. }
     
  12. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,342
    Yep, that never exists in the compiled shader.
     
  13. FranFndz

    FranFndz

    Joined:
    Sep 20, 2018
    Posts:
    178
    old thread but about the Swizzles.

    Checking in snapdragon profiler it do gave me some no-ops. like a.a = a.x;
    Im using Half always, if I see strange image artifact I identify it and change only that value for Float.
    I had many branch and build time issues (also shader compile in game) using the #if method.
    I deleted it all and keep mostly on blend lerp. I got almost 40% speed up on compile.
    All property not needed or that have 0 as values (at the end of the project nobody used that property) I remove it and change it as inside variable so the compiler can optimize it.

    Something that was unexpected but gave me almost 50% speed up in a character shader was removing texture for GPUs mathematical shapes. For example we have a character that have stripes on the rim mask, those stripes were a texture in photoshop 64x64 tiled and masked in the rim (border of character). I did the stripes using shaders sin() abs() fmod() all that stuff that is supposed to be super heavy and the result was.. clock speed down more than half from texture. (Android Snapdragon and Mali). so Character shader was around 200.000 clock (texture), it went down to 98.000 (example)
     
  14. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,342
    That is one of the unfortunate parts of OpenGL, especially for mobile. Runtime shader compilation is a real pain, especially for mobile. I would still suggest using #if in places where enabling / disabling the use of textures are involved.

    Yep, GPUs are really fast at math these days, even mobile ones. A single sin() a few years ago would have been way slower than your texture sample, but today it's not a problem. Just like on desktop GPUs, memory bandwidth is the main limiting factor, not ALU.