Search Unity

  1. Megacity Metro Demo now available. Download now.
    Dismiss Notice
  2. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

Experiments with Instancing and other methods to render massive numbers of skinned Meshes

Discussion in 'General Graphics' started by Noisecrime, Dec 23, 2016.

  1. Noisecrime

    Noisecrime

    Joined:
    Apr 7, 2010
    Posts:
    2,051
    Hi,

    For the past week I've been contemplating and experimenting with methods to render massive numbers of skinned meshes in Unity. Initially I was using DrawMeshInstanced, but quickly switched over to DrawMeshInstancedIndirect due to the potential performance benefits it offers. I figured it would be worth writing down my thoughts, findings, progress and bugs as a starting point for anyone else.

    Current progress is illustrated with the video below, which is best viewed fullscreen and using HD quality.



    It shows 10,000 Lerpz skinned meshes, each instance has a unique colour tint ( mostly for debugging, but in future to show that the appearance of each can esily be changed ) and is running its own independent animation, with frame interpolation. The cool thing about DrawMeshInstancedIndirect is that the frustum culling and dynamic LOD selection as well as the instancing itself is all done on the GPU. This saves a huge chunk of cpu time and in the video you can see framerates of between 160 and 250 fps depending on number of instances visible.

    As for potential i've had it running 40,000 instances and manage 30 fps on my GTX970. I'll provide some more details and add my thoughts and findings to the thread later.
     
    TMPxyz, colin299, jay3sh and 10 others like this.
  2. Noisecrime

    Noisecrime

    Joined:
    Apr 7, 2010
    Posts:
    2,051
    So as far as I'm aware there are two main methods to feasibly achieve a high number of 'skinned' mesh animation instances in Unity.


    Baked Meshes
    This method involves creating a unique mesh for every frame, of every animation, of every model and every LOD in the game. The you choose the correct frame and use DrawMesh to render it. Even though as a technique it sounds awful from what I've seen this is perfectly capable of achieving a high number of rendered models on a good system.

    Drawbacks
    • Potentially heavy memory requirements depending upon the framerate of your animations, number of models, number of animations and number of LODs.
    • The animation is basically like a 'flip book', as you are simply presenting a static mesh each frame.Granted we've had that in 2D games with sprites for decades, but I'm skeptical that it will feel good when mixed with other none baked animations, physics driven objects, if the framerate is not a multiple of the baked animation rate.
    • Culling and LOD (generally ) still has to be done on the CPU in order to use DrawMesh.

    Positives
    • Interacts as expected with all the rest of the Unity rendering systems with no additional effort.
    • Relatively straightforward to implement.
    • Doesn't tax the GPU.
    • Works for older GPU systems as long as there is enough memory.

    In fact it is entirely possible that the downsides could be address with some clever tricks and pushing more work to the GPU. For example providing the next frame mesh data with the current mesh, so you could interpolate between the two something akin to the old MD2 Quake format in the vertex shader, should be feasible.



    Skinned Instanced Meshes
    Generally a more complex method as it requires implementing you're own skinning on the GPU as well as supporting instancing and dealing with a bunch of other stuff. This requires extracting all the bone animation data and passing it to the GPU, which can then be used with custom shaders along with instance ID to render the mesh and animate it with Bone Matrix palette skining. There is a good example/source for this on the nvidia website and in GPU gems 3.

    Thankfully with ComputeBuffers you no longer have to pass the data via textures like they did in 2007. Though everything I've read implies that using textures and bilinear interpolation can automatic provide inter-frame interpolation. I'm not sure about this as my understanding is you cannot simply lerp two matrices. The positional part should be fine, but weird things are going to happen to the rotations. I'm going to have to give it a try sometime though and see as one of the biggest drawbacks of this method is that it will make your GPU cry, due to the amount of effort placed on the vertex shader.

    Drawbacks

    • It will use every ounce of your GPU power. The bone matrix palette skinning and all the look ups required is a constant overhead and its per vertex! This is what makes LOD so important, as every vertex saved means saving considerable processing time in the vertex shader.
    • Doesn't always play nice with Unity rendering systems due to instancing. I think there might be a number of gotcha's coming up with this, such as supporting lightprobes, forward rendering not working with multiple lights. I know there is a bug in the demo currently for shadows where they no longer respect the positions of the drawn instances. I think this is shader related, as i'm sure it was working fine before adding frustum culling method or LOD.
    • Shadows are an issue, as they require rendering the instance again or with cascades maybe several times. Since as stated the bottleneck of this technique is the vertex skinning, that will become amplified. Essentially every time you render the instance again it will halve your framerate. This could be alleviated with 'streamout' where you can store the shader resultant geometry on the GPU, which AFIAK is what Unity uses for its own GPU skinning. However due to the shear number of instances being rendered this would be prohibitive in this case and worse than Baked Meshes in terms of memory requirements.
    • Forward Rendering has the same issue as shadows, every light requires an additional add-pass, which just becomes prohibitively expensive as you are running all the vertex calculations again. However so far my experience has been forward rendering with multiple lights is just broken and even if it worked according to the docs the add-pass instances would be rendered normally instead of instanced. It might just be feasible though if we can build on the custom shader provided by Vavle for its VR LabRenderer, which I believe supports quite a few lights and shadows in forward rendering without using the add-pass technique.

    Positives
    • Greatly reduce amount of data to store on the GPUdue to frame interpolation. In the above demo the animation is stored at just 10 fps, however that can easily be increased to say 30 fps and still only take a fraction of the storage that baked meshes would.
    • Can easily off-load frustum culling and LOD selection to the GPU which can save a good chunk of cpu time. In addition I want to add per instance depth sorting to minimize overdraw ( not sure how much of a win in deferred that will be ). Taking it further you could even drive the entire crowd on the GPU using simulation.
    • To a degree its easily scalable to your hardware, simple to adjust number of instances, use lower vertex count models, dynamically change LOD settings etc.
    • Its even possible to drive the instances via Mecanim animator, though not possible to have an animator per instance, not even close and performance will suffer.




    Driven by Mecanim
    Its possible to drive the skinned instance method via Mecanim, but it cannot have each instance using a individual animator/animation.

    Mecanim is pretty amazing but it has a reasonably large overhead, an overhead that is considerably worse when not being able to use the 'optimize gameObjects' option. That option cannot be used as currently the only way to get the animating bone data is to fetch the transforms of each bone. If only Animator component could supply an array of Matrix4x4 for each bone instead of driving transforms you could probably double the number of Mecanim animations driving instances. However this would still end up as a fraction of the potential instances that could be rendered.


    Its all rather complex
    Once you have your chosen method up and running, things are still more complex to deal with than normal as the main point of both systems is to completely remove/detach the rendering of instances from Unity's gameObject model.

    Its the gameObject model that can really hammer performance once you scale up to 10,000 or more objects. Modifying the transforms, updating bone transforms etc, it all adds up as an overhead. Both the suggested systems avoid gameObjects per instance and instead should work with arrays of position/rotation data ( matrix4x4 ), but this means its somewhat harder to create a generic system that could easily be plugged into any project and would require the developer to drive their game more via code.


    So Many Possibilities
    Currently i'm undecided as to which method is best or indeed if there even is a best. I suspect each has its place depending upon project requirements. Though both have some serious drawbacks I believe they can be addressed with some lateral thinking and effort.

    Beyond that there is then the consideration of variation. Its all very well rendering 10,000 instances of the same model, but even if the animation of each is independent, they all look the same. Colour tinting on its own isn't enough, so considerable effort will have to be employed to find the most optimal methods of creating variations using the same input data. There are a number of avenues to pursue for this, from a simple instancing of parts ( e.g. different heads, helmets, weapons, clothing ) to more advance concepts such as Valves Left4Dead Gradient Mapping.
     
  3. pld

    pld

    Joined:
    Dec 12, 2014
    Posts:
    7
    Great post. My .02 on this:

    But you can! Hand-wavy proof: you can take the derivative of a matrix-valued function; as long as your derivative is relatively well behaved, you can pretend that the derivative is constant (for small time values). This is the idea behind nlerp.

    Another hand-wavy proof: think of what's going to happen to your basis vectors. For small rotations, they'll remain mostly orthogonal, and mostly unit-length. For larger rotations you'll start seeing more issues.
     
    Noisecrime likes this.
  4. Noisecrime

    Noisecrime

    Joined:
    Apr 7, 2010
    Posts:
    2,051
    Yeah for simplicity I just lerped two matrices during testing and didn't notice anything horrendous happening. I guess maybe the fact that its always lerping between two fixed matrices help ( i.e. not accumulative) and that generally the keyframes are pretty close that it remains well enough behaved.
     
  5. Afif-Faris

    Afif-Faris

    Joined:
    Oct 11, 2013
    Posts:
    16
    Whoa thats awesome !
    Any chance you will share the unity source project from the video?? I want learn how you did gpu instancing and the lod system.

    I am not sure if you already know this, but its related.
    This game made in unity, 25k animated characters in real time battle and the performance is amazing. I don't know how they did it.
     
  6. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,329
    How are you reading the animation data? Are you parsing files on the disk directly, or do you have some method of reading the curves from animation clips directly.
     
  7. Noisecrime

    Noisecrime

    Joined:
    Apr 7, 2010
    Posts:
    2,051
    At some point I may release the code or put it up on the asset store. The problem is that so far it is very specific and not general enough to simply be plugged into any project.

    The EpicBattleSimulator is or at least was based around using baked meshes and using DrawMesh in order to acheive its impressive performance. The developer spoke about this in a thread on the unity forums that was discussing in general terms producing a RTS in Unity with many thousands of units - a google search should find it.

    The key point is that having tens of thousands of gameobjects in Unity will never perform adequately as its not designed to. Therefore instead of having each unit defined as a gameObject you would instead create a class that deals with rendering the models via graphics.DrawMesh() instead. The use of baked meshes means you can avoid the expensive cpu bone/vertex animation costs as you simply switch between meshes like a sort of flipbook.
     
  8. Noisecrime

    Noisecrime

    Joined:
    Apr 7, 2010
    Posts:
    2,051
    I simply created my own animation format ( a Unity scriptableObject as a container for the data), that extracted the matrices of each bone for each frame of animation. This data is then supplied as a matrix array via a compute buffer so that gpu skinning can be used for each instance and each instance can have its own frame index into the animation data.

    Its pretty simple to extract the animation data from legacy animations via animationclip.sampleAnimation(), mecanim is a bit harder/awkward.
     
    bgolus likes this.
  9. Deleted User

    Deleted User

    Guest

    I've also made a similar opensource project about this post.

    project@github

    screenshot.jpg
     
    vanger1, tinyant, Rewaken and 10 others like this.
  10. AndreaBrag

    AndreaBrag

    Joined:
    Jan 30, 2015
    Posts:
    7
    Hey @Noisecrime,

    Great video, I just started looking for rendering techniques in order to achieve decent frame rates with a couple thousands of units, GPU skinning surely looks like worth a shot!
    I couldn´t find much resources online regarding DrawMeshInstanced/DrawMeshInstancedIndirect and the overall topic, would you mind answering some questions?

    - What´s the actual difference between DrawMeshInstanced and DrawMeshInstancedIndirect?
    - Is there some documentation you can share on how to achieve this result?
    - What is, in your opinion, the best way to extract bones animation´s data (API wise)? What about storing them in a texture using RGB as coordinates as they showed in a Unite16 video?

    Thank you for your time.
     
    Last edited: Mar 28, 2017
  11. dreamerflyer

    dreamerflyer

    Joined:
    Jun 11, 2011
    Posts:
    927
    hi ,i test your demo ,it has some performance issue ,and fail to run in ipamini2.
    gpuskinning bug.jpg bug3.jpg bug2.jpg
     
  12. alfiare

    alfiare

    Joined:
    Feb 10, 2017
    Posts:
    30

    Hi there, so I'm new to all this but I'm loving the instancing.
    I'm attempting to get the animation data out of animation clips and feed it to a compute buffer that can then be read in a vertex shader to transform the object based on the bones. Having some trouble getting the transforms coming from the bones right. I'm doing the SampleAnimation call on the clip and then reading the localToWorldMatrix off the transform on the bones in the SkinnedMeshRender. But when I apply these in the vertex shader the pieces of the mesh all come apart. Maybe I'm just grabbing the wrong thing off the bone transform? How did you get the matrices for the bones from the animation clip sampling?
     
  13. Noisecrime

    Noisecrime

    Joined:
    Apr 7, 2010
    Posts:
    2,051
    If I remember correctly its

    root * bone.localToWorldMatrix * bindposeMatrix

    where
    root = go.transform.localToWorldMatrix.inverse
    and bindposeMatrix is accessed via skinnedMeshRenderer.sharedMesh.bindposes which is an array that matches the bones array.
     
  14. marwi

    marwi

    Joined:
    Aug 13, 2014
    Posts:
    138
    Hello, just found your thread as I've been diving into instanced animation as well since roughly a week now. Here's a short video of some horses running around :)
    Have you used the systems in one of your projects or would you share some more information about your learnings so far?
     
  15. marcatore

    marcatore

    Joined:
    May 22, 2015
    Posts:
    160
    @marwi good examples in your tweets.
    Have you planned to release or share your system?
    I'd like to make something similar to manage in an efficient way a crowd but I'm really noob about how to create.
    Do you know any kind of tutorial, article or something where I could start to study the way to achieve something similar what you did? I know that , probably, you assembled different knowledge from your experience but if you can address me to everything could be helpful, I'll really appreciate.

    Thanks in advance.
     
  16. richardkettlewell

    richardkettlewell

    Unity Technologies

    Joined:
    Sep 9, 2015
    Posts:
    2,281
  17. marcatore

    marcatore

    Joined:
    May 22, 2015
    Posts:
    160
    @richardkettlewell thank you.
    I've tested it and it seems it's working. Now I should understand how to position the instances where I want.
     
    richardkettlewell likes this.
  18. marwi

    marwi

    Joined:
    Aug 13, 2014
    Posts:
    138
    Hello @marcatore, thanks. I haven't really thought about it yet. The system should be quite useable for managing a crowd actually because of the recent refactoring (decoupled skinning (with or without pushing animation data to gpu), logic (e.g. with compute shaders) and rendering (aka call to graphics.instancing methods)). Do you have a concrete project you would need the system for or is it rather for research/learning purposes?

    I once collected some links related to gpu/shading in a gitlab snippet here: https://gitlab.com/snippets/1671386 maybe this might be useful for you too :)
     
    marcatore and thelebaron like this.
  19. marcatore

    marcatore

    Joined:
    May 22, 2015
    Posts:
    160
    @marwi thank you very much. Really.

    About your question I have a concrete project.
    In few words I'm in a very small team where we're developing a rally sim and we'll have mainly two kind of stage. Classic rally stage with open path and circuit stage with closed path.
    So..the crowd will be less dense in the open path stage and more dense in the closed ones, positioned on the grandstand.
    The target is, due to that we're not doing a crowd simulator :) , to have a quite good animated people to see near the roads that could be not so much heavy to render.
    So I think tha my tools could be searched in a thread like this... Am I in the right way or should I change my view to other? :)

    Thanks in advance for every tips. :)
     
  20. Noisecrime

    Noisecrime

    Joined:
    Apr 7, 2010
    Posts:
    2,051
    sorry am late. Not used the system in project, got side tracked with client work. Alas i'm now recovering from a heart attack, so its unlikely i'll be able to answer any questions or do more work on this for several months. Hopefully other posters here have/can help.
     
  21. hopeful

    hopeful

    Joined:
    Nov 20, 2013
    Posts:
    5,676
    Best wishes for a speedy and comfortable recovery! I know it can't be a great experience - right? - but I hope for you it can be mostly on the better side.
     
    Noisecrime likes this.
  22. marwi

    marwi

    Joined:
    Aug 13, 2014
    Posts:
    138
    No need to apologize, very sorry to hear that! Wish you a fast and good recovery as well!
     
    Noisecrime likes this.
  23. Danistmein

    Danistmein

    Joined:
    Nov 15, 2018
    Posts:
    82
    Amazing thread, Best wishes for comfortable recovery!
     
  24. Carterryan1990

    Carterryan1990

    Joined:
    Dec 29, 2016
    Posts:
    79
    upload_2019-8-4_18-21-2.png

    Here's mine. They are animated as well. The main character has 6k polygons and was downloaded from mixamo. As you can see there are 37000 animated characters and fps is at 700. Ive actally gotten over a million with super lowpoly characters under 500 poly. All animated. Working out some more stuff but plan to share soon!
     
  25. riba78

    riba78

    Joined:
    Feb 16, 2018
    Posts:
    33
    great!!!

    But what happend if you mix different animations?

    Curious to see the final results... and if you want to share the project or the system it will be super useful.
     
  26. Carterryan1990

    Carterryan1990

    Joined:
    Dec 29, 2016
    Posts:
    79
    Ohh they absolutely work the only downside is for each animation you will get an additional draw call per animation but in the grand scheme it really isnt an issue. So instead of 6 calls for 37k, if you have 10 animations there would be 16 draw calls for 37k. Not a big deal, though I do believe i can fix that, just need time... and a bigger brain XD
     
    FrenzooInfo and hopeful like this.
  27. Carterryan1990

    Carterryan1990

    Joined:
    Dec 29, 2016
    Posts:
    79
    upload_2019-8-5_8-47-14.png
    Upped it to 74k. Noticeable drop in fps but i think 420 XD fps is still manageable.
     
  28. FrenzooInfo

    FrenzooInfo

    Joined:
    May 2, 2014
    Posts:
    44
    Is the source downloadable somewhere? I would like to learn and see how it would integrate with ECS and also the job system,,
    Thanks
     
  29. marwi

    marwi

    Joined:
    Aug 13, 2014
    Posts:
    138
  30. FrenzooInfo

    FrenzooInfo

    Joined:
    May 2, 2014
    Posts:
    44
  31. MintTree117

    MintTree117

    Joined:
    Dec 2, 2018
    Posts:
    340
    Thank you for the informative post! Correct me if I am wrong, but would using baked meshes result in a separate draw call for every mesh? Because if you are using a different mesh for each animation frame, then that's a draw call?
     
  32. Noisecrime

    Noisecrime

    Joined:
    Apr 7, 2010
    Posts:
    2,051
    Depends on how you call it. Using DrawMesh is likely to result in a drawcall per mesh if I remember correctly, so instead you might want to try DrawMeshInstnced. That way as long as you are using shaders and materials that support instancing Unity will render multiple instances of the same baked mesh at once. This is also why generally when rendering so many entities you try to have a system where by many of them are sharing the same frame of animation.

    It should be noted that drawcalls in themselves are not always bad, and modern gpu's can easily copy with 10'000s of them a frame, even mobile gpu's can support a good number likely in the several hundreds or more these days ( not really tested that though ).

    What you do have to watch out for are setpass calls ( in Unity ) thats when you change the state of gpu, such as using a new material or shader. So if you have a single setpass call and then 10,000 drawcalls you should be fine, but if you have a setpass call every other drawcall then performance is likely to suffer greatly.

    Further more as mentioned in my initial posts, replacing gameObjects with drawcalls will always be a win as Unity no llonger has to create and manage all those gameobjects, more so if each gameObject had a MonoBehaviour on it etc.
     
    MintTree117 likes this.
  33. MintTree117

    MintTree117

    Joined:
    Dec 2, 2018
    Posts:
    340
    Very interesting, thank you. In your opinion, which option would allow me to animate more characters, baking animations into meshes or into textures? I am just starting learning about animation, so I do not have any reference point. For context, I suspect if I used the baked mesh approach I would have up to around 4000 drawcalls and maybe a hundred set/pass calls.
     
  34. Noisecrime

    Noisecrime

    Joined:
    Apr 7, 2010
    Posts:
    2,051
    Not enough information to make an informed decision. However neither option is going to work better with high setpass calls, that is a limit that can only be negated through other means and is likely what i'd focus on more.

    Depending upon your experience/knowledge baking animation frames into meshes is probably easier from a getting started/coding point of view. So i'd start with that, then focus on reducing setpass calls as that level of knowledge will be transferable between systems.

    To reduce setpass calls you need to reduce material counts and minimize per material changes. You should find plenty of ideas on the forums for this. However the approach you go with will be quite project and asset specific so not sure I can give any more specifics. If you do have to change material settings look into MaterialPropertyBlocks which can be used with DrawMeshInstanced. On the whole reducing setpass calls is more of an asset rather the code issue, at least at first. Then you can look at creative ways to supply different material properties or state changes via shaders and code.
     
    Death_Nova and MintTree117 like this.