Search Unity

  1. Megacity Metro Demo now available. Download now.
    Dismiss Notice
  2. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

gpu instancing optimization

Discussion in 'Shaders' started by geroppo, Mar 11, 2017.

  1. geroppo

    geroppo

    Joined:
    Dec 24, 2012
    Posts:
    140
    Hi, I've never worked with instancing until a few days ago, and I got a few questions that I would like to ask. Currently instead of using Graphics.drawInstanced I'm using Graphics.drawProcedural, so my first question is:

    1) is there a difference between them? can the same be achieved with both?

    What I'm doing is attaching a few ComputeBuffers to a material( indices and vertex attributes), and in the vertex shader I access them with this:
    Code (CSharp):
    1.  
    2. v2f vert (uint vertex_id : SV_VertexID, uint instance_id: SV_InstanceID)
    3. {
    4.                 v2f o;
    5.                 vertex_attributes v = vertices[indices[vertex_id]];
    6.                 o.vertex = UnityObjectToClipPos(v.vertex);
    7.                
    8.                 return o;
    9. }
    10.  
    But for some reason, I'm not able to render as many instances as I thought I would (having seen all the demos with a ton of asteroids or things like that). As in the example code I wrote above, I tried to simplify everything as much as I could, but having 4000 instances I was getting 44ms (on a radeon HD6870 ).
    The problem (as far as I can tell) seems to be the amount of vertices that my model has ( around 1000, so around 3000 triangles... ) because the only way I could find to reduce the ms was reducing the amount of either instances or vertices on the model.
    Note that I also tried to reduce the amount of attributes that I'm sending per vertex and per instance (per vertex I only send the vertex position, and per instance the world position, so it's just 6 floats, about 24bytes), but didn't seem to matter. So my second question is:

    2) is there something to be done to optimize this aside from reducing the amount of vertices? I am aware that I can implement culling on the gpu as well ( and I plan to do so ), but first I want to try and see how many instances I can get on my screen while keeping good/average framerate.

    Any help is appreciated. Thanks!
     
  2. Fabian-Haquin

    Fabian-Haquin

    Joined:
    Dec 3, 2012
    Posts:
    231
    I'm not sure DrawProcedural support instancing since it doesn't even support shadows... Why are you using a ComputeBuffer ?
     
  3. geroppo

    geroppo

    Joined:
    Dec 24, 2012
    Posts:
    140
    Because I'm using a compute shader to calculate the positions of the objects.
    And what do you mean by DrawProcedural doesn't support instancing? I'm still trying to figure out the difference between DrawInstanced and DrawProcedural, because from all I can tell, they should end up doing the same thing, right?
    But from my tests, drawing 1000 objects with DrawProcedural takes 2.5ms, and with DrawInstanced takes 1.8, using the same bare bones shader, so unity must be doing something else to optimize something that I'm not aware of...
     
  4. Fabian-Haquin

    Fabian-Haquin

    Joined:
    Dec 3, 2012
    Posts:
    231
    I don't know the differences between theses functions except the fact that DrawMesh/DrawMeshInstanced support shadows and DrawProcedural does not.

    Therefore I would not be surprised that there are others differences.

    This topic may give you some hints
    https://forum.unity3d.com/threads/no-shadow-handling-when-calling-graphics-drawprocedural.260579/

    I never mixed Instancing and a computebuffer since I always use it with a Geometry shaders (single point mesh where I create faces), Instancing is pointless in my case. (Instancing allow to store the mesh data only once GPU-side instead of one per instance).
     
  5. geroppo

    geroppo

    Joined:
    Dec 24, 2012
    Posts:
    140
    Alright I see now what's going on. I used Intel GPA to see the calls that were issued, and it seems that DrawMeshProcedural internally calls directx's DrawInstanced, while DrawMeshInstanced internally calls directx's DrawIndexedInstanced and this drastically reduces the vertex shader invocations.
    So in conclusion it seems that DrawProcedural is not very good if you are going to draw meshes with large amount of vertices. This is a shame because I cannot make culling with a compute shader and then calling DrawInstanced, but oh well.
    And for future reference, using a mesh with 618 vert and 738 tris takes (in my GPU):
    - with DrawProcedural 10.7ms
    - with DrawInstanced 3.8ms
     
    Last edited: Mar 13, 2017
    Fabian-Haquin likes this.
  6. Fabian-Haquin

    Fabian-Haquin

    Joined:
    Dec 3, 2012
    Posts:
    231
    That's confusing !
     
  7. geroppo

    geroppo

    Joined:
    Dec 24, 2012
    Posts:
    140
    I'm sorry, I attach 2 images below that hopefully will make it clear, in both scenarios I'm rendering the same ( or almost) amount of instances.
    The first one is using Graphics.DrawProcedural
    Screenshot_6.png
    The second one is using Graphics.DrawInstanced
    Screenshot_7.png

    Notice that Graphics.DrawInstanced is calling DrawIndexedInstanced
     
  8. Fabian-Haquin

    Fabian-Haquin

    Joined:
    Dec 3, 2012
    Posts:
    231
    I understood what I meant, it's the fact that one DrawProcedural call DrawInstanced and DrawInstanced call another DrawInstanced... :D
     
  9. geroppo

    geroppo

    Joined:
    Dec 24, 2012
    Posts:
    140
  10. Fabian-Haquin

    Fabian-Haquin

    Joined:
    Dec 3, 2012
    Posts:
    231
    Interesting,

    What was this limit of 1023 you were talking about ?

    Have you check how many batch you have with your differents setups ?
    If I understand it well, you should have only one batch for all your meshes if instanced (unless you exceed the 65535 vertices limit).
     
  11. geroppo

    geroppo

    Joined:
    Dec 24, 2012
    Posts:
    140
  12. Fabian-Haquin

    Fabian-Haquin

    Joined:
    Dec 3, 2012
    Posts:
    231
    You know why this limit was imposed on DrawMeshInstanced ?

    You will code your game entierly on GPU, including collisions, effects, etc...?
    You will never retrieve GPU computed datas CPU-side because of the bottle-neck I guess ?
     
  13. geroppo

    geroppo

    Joined:
    Dec 24, 2012
    Posts:
    140
    I don't know why they put that limit, but I'm guessing it may be related to the "other tasks" that they perform with DrawMeshInstanced, for example shadow caster pass. Maybe they wanted to keep it sane.

    I don't know if the last 2 questions were directed to me or you were talking about unity's implementation, but currently I'm working with a lot of particles and moving objects, so I try to offload the cpu as much as I can. In the case of the particles I'm generating them with a compute shader and culling them with another, so I never have to perform a readback to the cpu, it all stays in the gpu. With the processed elements in a compute buffer I can attach them to a material and access them from there on the vertex/fragment shader.
     
  14. Fabian-Haquin

    Fabian-Haquin

    Joined:
    Dec 3, 2012
    Posts:
    231
    It was just curiosity about your work!

    I'm still wondering why Unity don't include a full GPU Shuriken, at least without physic interaction since it allow way more particles to be shown and moved.

    Anyway, have fun.