Search Unity

How to measure shader performance?

Discussion in 'Shaders' started by kebrus, Dec 26, 2013.

  1. kebrus

    kebrus

    Joined:
    Oct 10, 2011
    Posts:
    415
    Hi everyone, i was trying to optimize a shader i was working on and then i hit a wall, to put it simply i could tell if the optimized code was actually better than the non-optimized.

    For instance, is their any difference between using pow(base, 2) and base*base? For reason i'm assuming there's an additional overhead using the pow function, how about two lerps versus a few arithmetic calculations including three divisions?

    How can measure the performance of shader in very small changes?
     
    bobbaluba likes this.
  2. imaginaryhuman

    imaginaryhuman

    Joined:
    Mar 21, 2010
    Posts:
    5,834
    Well, I guess I would set up a simple scene with a piece of geometry and the shader applied to it, then measure the framerate, and if it's way too little rendering to tell, draw the same geometry lots of times per frame (duplicate the object). If you have Unity Pro I guess you can use GPU profiling?

    Also you could look at the compiled shader code and see how many instructions are being used.

    Generally divisions are slow/bad... try to convert them to multiplications where possible. e.g. 10/5 could become 10*(1.0/5), which you can then change to 10*0.2. Some of the built-in language functions though do what they do as optiminally/optimized as possible so it's good to use them instead of trying to reproduce their same functionality yourself... UNLESS you know better and can use the fact that you are informed about what needs to happen/what result you want/whats relevant and what isnt that you can take some kind of shortcuts or optimizations that the compiler wouldn't be smart enough to do. In your case I would use base*base instead of power, I would be surprised if the power thing was faster. But testing it and seeing a real number to represent your changes is best.

    Usually, reading textures is the slowest part of most shaders, the more reads you do the slower it gets. other instructions tend to be fairly fast provided you're not on a mobile-style gpu which can get slow quite quickly if the shader is doing a lot.
     
  3. kurylo3d

    kurylo3d

    Joined:
    Nov 7, 2009
    Posts:
    1,123
    Curious about your response. You said reading textures is slow. I am going to test something in mobile, but maybe u might have an answer already anyway. I am thinkn about using a vertex shader to blend textures or select textures by color of vertice... Example instead of 3 materials for 3 textured objects for 3 draw calls... 1 material with a vertex shader with 1 textured object combination that uses vertice color to select the texture.

    Would reduce the draw calls this way be beneficial on mobile? Does having 3 textures in a shader for vertex shader hurt anything other then memory?
     
  4. hippocoder

    hippocoder

    Digital Ape

    Joined:
    Apr 11, 2010
    Posts:
    29,723
    Your biggest gains are doing as much as you can in vert, this is always fast - even division. Branches are slow, and should be avoided wherever possible in both frag and vert. Frag should be as simple as you can possibly get it. Avoid POW and other math operations. Better to use look up texture if really have to do stuff like that.
     
    jniac likes this.
  5. kebrus

    kebrus

    Joined:
    Oct 10, 2011
    Posts:
    415
    Thx for the responses, i tried the whole multiple objects test but it's really hard to tell from the fluctuations, or maybe i don't have enough objects on the scene, or the test itself is irrelevant

    i already avoid using branches, what i usually do instead is "lerping" or "stepping" but it seems like a waste to calculation to me

    from experience on iPad 1 working with multiple textures on the same shader is a big nono, something like 4 textures already drops fps significantly

    any other thoughts on the subject would be highly appreciated
     
  6. Dolkar

    Dolkar

    Joined:
    Jun 8, 2013
    Posts:
    576
    You can use AMD's ShaderAnalyzer that will provide you some useful data about shader's performance.
    Note that simple float a = (b > c) ? d : e; is NOT branching and thus is as fast or maybe even faster than "stepping".
    Texture reads are tricky. If you just sample a texture with the UVs you get from a vert shader, then it's basically free, because it can be pre-fetched by the hardware. On the other hand, if you must do some computation in the fragment shader to get the UV coordinates, then it's a dependent texture read, which can halt the execution of the shader until the data arrives... the latency can be even >100 cycles, depending on the texture format and filtering.
    That means, that these days it's much faster to just do a pow() than a dependent, cache trashing, texture read, at least on desktop.
     
  7. imaginaryhuman

    imaginaryhuman

    Joined:
    Mar 21, 2010
    Posts:
    5,834
    Depends how many texture units the device has, too. Old ipad1 might've been limited to 2 or 4? .. Anyway.. yes reading textures is a slowdown... but you can read more textures overall if you read more than one within a shader pass, possibly 30-50% more reading is possible, it seems. I would think the vertex blending of multiple textures should be faster than several draw calls.
     
  8. Dolkar

    Dolkar

    Joined:
    Jun 8, 2013
    Posts:
    576
    read + read + read + write
    is of course faster than
    read + write + read + write + read + write
     
  9. duke

    duke

    Joined:
    Jan 10, 2007
    Posts:
    763
    How does a pow() relate to or replace a texture read or uv manipulation? Or are you just comparing an expensive function with a texture read?
     
  10. hippocoder

    hippocoder

    Digital Ape

    Joined:
    Apr 11, 2010
    Posts:
    29,723
    Pretty sure that depends on the platform/gpu.

    It most likely is a branch on some hardware, or if you're lucky, it gets turned into a step.
     
  11. Dolkar

    Dolkar

    Joined:
    Jun 8, 2013
    Posts:
    576
    I'm comparing the cost of a once expensive function with a cache-trashing, bottleneck-introducing evil read of a precomputed pow lookup texture.

    I'm 100% sure that on ALL hardware that even support dynamic branching (and I'm fairly certain older cards were fine with it too), it gets turned into a super cheap conditional assignment in one way or another. Ironically enough, your step actually compiles into the same conditional assignment.
     
  12. frogsbo

    frogsbo

    Joined:
    Jan 16, 2014
    Posts:
    79
    For One Platform, solution:

    Empty scene, 1 mesh, the shader full screen perhaps rotating, measure average frames of unity3d over 15 seconds!???!?!? i mean, a really heavy shader here slows down the framerate quite simply, and you can measure changes of 1 percent that way. takes 2 minutes to construct a shader benchmark scene.
     
  13. Lex-DRL

    Lex-DRL

    Joined:
    Oct 10, 2011
    Posts:
    140
    Sorry for necroposting, I think it's better then opening a duplicate thread.

    Still didn't find an answer: how do you measure the actual performance of each statement? How do you decide during a shader writing which way to do it would be faster?
    What's the final unit a performance it's measured in? Is it # of instructions or what?

    I've been a shader programmer for a while, but I steel need to use testing to tell what is faster.

    How can I tell which would be faster? And how faster, exactly.
    For example, is function call is faster, slower or the same as calling a macro?
    How much pow() is slower then two, three, four, five... multiplications?
    Hom much texture read (with "indirect" UVs) would be slower compared to pow()? And if compared to pow() and multiplication? pow() and divide?
    Are two "clr *= someVar;" the same by speed as single "clr *= var1 * var2;" ?

    If I need to change only rgb components in my frag shader, which would be faster:
    Code (CSharp):
    1.  
    2. // clr is fixed4
    3. clr.rgb *= i.color.rgb;
    4. return clr;
    or passing "color.rgb" and "color.a" as two separate fixed3 and fixed variables from vertex shader and doing stuff like this:
    Code (CSharp):
    1.  
    2. // clr is fixed3
    3. // there's also fixed alpha
    4. clr *= i.color.rgb;
    5. return fixed4(clr, alpha);
    6.  
    In short, still don't get it how you can guess a performance of each separate piece of code before you actually wrote the entire shader end tested it in whole.
     
  14. Dolkar

    Dolkar

    Joined:
    Jun 8, 2013
    Posts:
    576
    Half of your questions can be answered by looking at the compiled shader. For dx11 platforms, Unity unfortunately does not provide it in readable format. What I use, though, is the above mentioned AMD's Shader Analyzer that lets you see the actual instructions behind the shader and even roughly how they perform on some (rather old) AMD cards.
    The rest is just experience and general knowledge. Both functions and macros perform identically because in the compiled shader, there are no functions or macros, just straight up code. Two multiplications are also the same no matter what syntax you use.
    In your last example, if you're not doing anything else with the values, I'd say passing it directly as a vector of 4 is a tiny bit faster because you avoid a mov instruction in the end. But if you really don't do anything else in the pixel shader, then it does not matter at all, because 99% of the time will be spent elsewhere in the pipeline... rasterizer, interpolators, ROPs.. or even just waiting for the memory. Bandwidth is often the most limiting factor on desktop GPUs after all.

    It is a lot more difficult to figure out the rest. If you look at the compiled code, you'll see pow(x, y) is compiled into a log2, mul and exp2 instructions. pow(c, y), where c is a constant only need a mul + exp, pow(x, c) depends on the value of c and often gets turned into multiplications instead where possible.
    Now, if you're asking how much is an exp2 instruction more expensive than a multiplication, then there's no satisfactory answer. It's different between AMD, Nvidia, mobile GPUs, desktop GPUs, new cards, old cards... Even the compiler does not know, in the case of hlsl. It might have a vague idea though.. like, it turns pow(x, 512) into 9 multiplications, but pow(x, 1024) into a log, mul and exp... Either way, you can see that it's often safer to rely on the compiler to do low level optimizations for you.

    Texture reads is a whole another beast. When a shader core "sends a read request", it often does not just idly wait for the data to come back, it switches to work on other code that is ready to process instead, possibly even in a completely separate pixel shader invocation. Generally you should have a lot more arithmetic instructions than texture fetches for that reason. The delay until the data comes back is further dependent on the texture format and filtering used, and most importantly if it has been fetched before and is currently stored in cache. Spatially coherent reads are cache friendly reads. That means, if neighboring pixels need to fetch texels that are also close to each other, it's much faster than if they are sampled all over the texture randomly. Since cache memory is limited, that also means reading from very small textures very often is relatively cheap - but most likely not cheap enough to make lookup textures for a single pow function worth it.

    My point with all this is that measuring every separate piece is not nearly enough. Hell, even testing the entire shader on some random data isn't accurate. You need to profile the shader in the actual, real scenario to get a good picture of the overall performance.

    As a disclaimer, I specialize in desktop graphics, so whatever I just said might or might not be drastically different on mobile.
     
    Last edited: Jun 19, 2015
  15. Jonny-Roy

    Jonny-Roy

    Joined:
    May 29, 2013
    Posts:
    666
    Do this, you just repeat your command you want to test loads of times to get a decent benchmark of speed.
     
  16. hippocoder

    hippocoder

    Digital Ape

    Joined:
    Apr 11, 2010
    Posts:
    29,723
    Someone liked this post recently so I decided to clarify this a little with things I learned from @bgolus not long ago and that is: it's no longer cut and dried with GCN+ architectures, since there's a number of potential bottlenecks. So these days it's more - use the right tool for the job, unless it's a big job in which case hire the guy with the neverending story avatar and don't look back.
     
    AcidArrow and bgolus like this.
  17. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,343
    To clarify a little, it's no longer "cut and dried" with basically any Shader Model 4.0 or better hardware, including OpenGL ES 3.0 mobile GPUs and almost all desktop GPUs for nearly the last decade. The raw ALU performance (how fast GPUs calculate math operations) has far outstriped memory bandwidth. 10 years ago transfering a float4 from the vertex to the fragment had a greater cost than calculating that same data in the fragment shader from other data if it used ~8* or fewer instructions.
    * I honestly can't remember the actual number, might have been high as 12 instructions.

    Ten years ago!


    An Nvidia GTX 260 bought in 2009 had an approximate performance of around 550 GFLOPs. A GTX 1060 is >4000 GFLOPs, so an almost 8x increase in ALU. The GTX 260's memory bandwidth was 111 GB/s, the GTX 1060 is 192 GB/s, so less than a 2x increase in memory bandwidth. The RTX 2060 and Vega 56 GPU are only in the 400~500 GB/s range, so 4x increase in bandwidth vs 12~20x GFLOP numbers compared to the GTX 260.

    Also, don't try to hire me. There are plenty of talented individuals out there capable of writing shaders and I am already gainfully employed with little free time to devote to contract work ... counter to what my post frequency might imply.
     
    Last edited: May 14, 2019