Search Unity

  1. Unity Asset Manager is now available in public beta. Try it out now and join the conversation here in the forums.
    Dismiss Notice

GPU Instancing

Discussion in '5.4 Beta' started by djarcas, Dec 29, 2015.

  1. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Have you considered moving the tile sheet/map system to a shader?



    And 5.4 should have texture array support!
     
    Last edited: Mar 12, 2016
  2. zeroyao

    zeroyao

    Unity Technologies

    Joined:
    Mar 28, 2013
    Posts:
    169
    I have a custom build based on 5.4 b10 at http://beta.unity3d.com/download/64a40738a323/public_download.html

    In this build there is a new API:

    Code (CSharp):
    1. public static void DrawMeshInstanced(Mesh mesh, int submeshIndex, Material material, Matrix4x4[] matrices, MaterialPropertyBlock properties , ShadowCastingMode castShadows , bool receiveShadows , int layer , Camera camera )
    on UnityEngine.Graphics class.

    What it does is to draw instances of the specified mesh through an array of transform matrices. Here lists some information that may be useful when you use this API:
    * Instances are not further culled. They are always rendered together or not, if the combined bounding box is out of view.
    * You can draw at maximum 1023 instances. The limitation comes from the fact that an array in MaterialPropertyBlock can hold maximally 1023 elements.
    * That said, additional per-instance properties should be provided via arrays in the MaterialPropertyBlock.
    * Other non-instanced properties can also be provided by the MaterialPropertyBlock, but currently I implemented it wrong that these properties will get submitted only if the number of instances matches the batch size specified in your shader (batch size is by default 500 under D3D and 125 under GL. You can modify that by using #pragma instancing_options maxcount:{number}).
    * Internally data will be copied so that you can modify the matrix array and MaterialPropertyBlock immediately after the API is called. There will be additional memory copies later in the rendering pipeline, but you can optimize that by always draw the number of instances that matches the batch size specified in the shader.
    * There are bunch of overloads to this API that gives you default values to shadow casting, shadow receiving, camera and rendering layer. Each argument is the same to those in DrawMesh API family.

    The API design is still in discussion and your feedback will be valuable to us. Also this is my untested first implementation so bugs are expected. If you find any that blocks your testing please let me know and a repro scene/script is appreciated.
     
    shaderop, Prodigga, landon912 and 2 others like this.
  3. pointcache

    pointcache

    Joined:
    Sep 22, 2012
    Posts:
    579

    thank you this is exactly what i ended up doing, i have outstanding performance.
     
  4. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Cool could you share with the community?
     
  5. pointcache

    pointcache

    Joined:
    Sep 22, 2012
    Posts:
    579
    Eventually when its more polished, i still havent tested it enough
     
    MrEsquire likes this.
  6. pvloon

    pvloon

    Joined:
    Oct 5, 2011
    Posts:
    591
    The API sounds great! Few requests though:

    -A "count" parameter, so we could set it up to only use a part of the matrix buffer. Would save reallocating that matrix buffer all the time to fire off the draw

    -A similair API but in a CommandBuffer fashion. Instanced deferred decals sound pretty sweet :)
     
    shaderop and mgooding like this.
  7. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Just to confirm 5.4.0b10 does not include a standard instancing shader, yet?
     
  8. Cripple

    Cripple

    Joined:
    Aug 16, 2012
    Posts:
    92
    When do you plan to make it work on Android ?
     
    MrEsquire likes this.
  9. laurentlavigne

    laurentlavigne

    Joined:
    Aug 16, 2012
    Posts:
    6,331
    The API looks great.
    At some point, it would be nice to have automated split in batches of 1023 adjacent instances as well as an override that accepts separate TRS -- to offload the mono backend.

     
  10. djarcas

    djarcas

    Joined:
    Nov 15, 2012
    Posts:
    246
    That's a good start, tho it doesn't allow us to have a per-mesh MBP. This means that in my game, I could use it for perhaps 30% of the machines (which is a good start, mind you!)

    So these will be great : https://pbs.twimg.com/media/CJzcuixWEAAQpvQ.png:large as they don't have any per-machine MPB (which is used to modulate the glow), however, these aren't instanceable with that system:

    http://hydra-media.cursecdn.com/for....jpg?version=d8cbca2a1ff2b8feb69b280869568ece
     
  11. djarcas

    djarcas

    Joined:
    Nov 15, 2012
    Posts:
    246
    Despite having used MPBs for the last 3 years, I've actually got no clue what you mean here. Help? :)
     
  12. mgooding

    mgooding

    Joined:
    Mar 6, 2014
    Posts:
    10
    This is great! My only request is to please change Matrix4x4[] to IList<Matrix4x4> so that I can re-use my matrix buffers when changing the number of meshes I'm drawing. Garbage collection will kill off my performance gains otherwise.

    Thank you for your work on this feature - very excited to incorporate it into my application.
     
  13. j_zeitler

    j_zeitler

    Joined:
    Dec 22, 2015
    Posts:
    2
    Nice! Made a small test with a rock model I had lying around. I don't know if I get any ground breaking performance increases but it seems to run at ok frame rates at least.

    (btw, sorry for the potato quality but I feel like that's not our focus right now)


    Scene is:
    1x Dir Light
    1x "Instanceable Object" that holds the rock mesh (250 verts/496 tris)
    Everything else on default (lighting setup etc.)

    then I just scattered a lot of empty objects (18432 to be exact) with ~random transforms around the scene which are not rendered directly but fed into the behaviour of the instanceable object.

    EDIT:
    @zeroyao: Two questions by the way
    1. Is it possible to get hold of the built-in assets for the beta. I'm kinda new to Unity but I can't find them here: https://unity3d.com/get-unity/download/archive
    2. Perhaps this has already been answered somewhere, but do you know when the next beta will be released? I guess this API (or similar) will be included in that. I would really like to try out instancing for a proper build.
     
    Last edited: Mar 16, 2016
  14. ArthurT

    ArthurT

    Joined:
    Oct 26, 2014
    Posts:
    75
  15. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    I've been running Cube Mark for a couple of years and just got scores from 5.4.10b.

    5.4b10 Running 3 x @ 2560 x 1600 full screen on my PC. (DX 12)



    Using combine static mesh utility to reduce draw calls.



    Using the instancing shader.

    And for comparison with the latest build of 5.3.4

     
  16. zeroyao

    zeroyao

    Unity Technologies

    Joined:
    Mar 28, 2013
    Posts:
    169
    For the count parameter, what are you exactly going to do with it? Are you going to cull your instances manually and copy them to an array?

    CommandBuffer API will definitely be looked at.
     
    mgooding likes this.
  17. zeroyao

    zeroyao

    Unity Technologies

    Joined:
    Mar 28, 2013
    Posts:
    169
    No. Would you like to tell me what are you going to use with this instanced Standard shader? You don't need any instanced properties? I'll need to talk to my coworkers to figure out what I should do.
     
  18. zeroyao

    zeroyao

    Unity Technologies

    Joined:
    Mar 28, 2013
    Posts:
    169
    Not for 5.4.
     
  19. zeroyao

    zeroyao

    Unity Technologies

    Joined:
    Mar 28, 2013
    Posts:
    169
    The problem is with the MaterialPropertyBlock that if you provide per-instance data with it as arrays you are limited with 1023 elements. Of course if you are not going to use the MaterialPropertyBlock (i.e. you don't have any instanced property) we could do auto-splitting but that will make the API harder to understand. I think in later release (5.5 or 5.6) we will introduce some new APIs to let you fill out the constant buffer directly. If such an API is implemented we might even not to have matrices and materialpropertyblock parameters in this DrawMeshInstanced call. You can then fill the constant buffer, call DrawMeshInstanced with the batch size you want, and fill out the constant buffer for the next call, call DrawMeshInstanced again, etc. You can have the full control by then and eliminating any unnecessary internal data copyings.
     
  20. zeroyao

    zeroyao

    Unity Technologies

    Joined:
    Mar 28, 2013
    Posts:
    169
    I started a private conversation with you several days ago. I'd like to chat with you about your need of per-mesh MBP. Please check out.
     
  21. zeroyao

    zeroyao

    Unity Technologies

    Joined:
    Mar 28, 2013
    Posts:
    169
    Still need to do more work to get this into a shippable shape before it appears in a future beta release. 2 weeks? 3 weeks? I don't know yet.
     
  22. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    I found the zipped instanced shader, as mentioned in previous post #114 to give it a test run in my Cube Mark app.

    I was using the combine meshes utility in the previous version and tried it in DX12 in the 5.4 beta, found that it is a lot faster than 5.3.4.

    Then tried it again with the instanced shader and because I was already using combine mesh the performance degraded, you might want to do a test in combine mesh's to see if it's a material with an instance shader.

    So removed combine meshes and with the instance shaders got a slightly faster result than with combine meshes but only just. (see # 114 above).

    The benchmark simply instances cubes with normal map shaders, once with colliders, without and again but with quads.

    Would it be faster for the without colliders test to use the Draw Mesh Instanced function?

    Also is there a fast way to 'instance' lots of colliders for instanced meshes?
     
  23. Velcrohead

    Velcrohead

    Joined:
    Apr 26, 2014
    Posts:
    78
    Is the Instance shader provided with 5.4? I haven't updated yet but I'm interested to see it.
     
  24. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
  25. mgooding

    mgooding

    Joined:
    Mar 6, 2014
    Posts:
    10
    This is exactly what I'd like to do. Either an IList<Matrix> or fixed array with a count is acceptable, similar to the SetVertices etc. methods that were added to Mesh in 5.3.

    I would be very happy to never see a Unity API that takes a raw array without a count parameter again. These are terrible for those of us trying to fight the garbage collector.
     
  26. MrEsquire

    MrEsquire

    Joined:
    Nov 5, 2013
    Posts:
    2,712
    On the road-map it says:

    -> OS X and mobile platforms support might come in 5.4.

    Is there any news on this, or work has not even started due to the fact still trying get stability for existing platform.
     
  27. superpig

    superpig

    Drink more water! Unity Technologies

    Joined:
    Jan 16, 2011
    Posts:
    4,657
    If your scene is static and doesn't need per-object culling (and thus capable of being combined into a single mesh like that) and you don't mind burning the memory on combining the meshes, then it's usually going to be faster to just do that.
     
  28. pvloon

    pvloon

    Joined:
    Oct 5, 2011
    Posts:
    591
    Imagine trying to a decalling system or similair. With the current API I'd need to reallocate the matrix array whenever the nr. of decals changes (say, a new bullet hole, could be really frequent). Rather I would like to just allocate one huge array, fill that in with the currently used decal matrices, and pass in the count of decals actually used. And yeah people who need to manually cull for some reason will have the same problem.

    Command buffer API Sounds great :)
     
    mgooding likes this.
  29. Sebioff

    Sebioff

    Joined:
    Dec 22, 2013
    Posts:
    218
    OS X support was added in beta 8.
     
    MrEsquire likes this.
  30. zeroyao

    zeroyao

    Unity Technologies

    Joined:
    Mar 28, 2013
    Posts:
    169
    Ok I see. Will do.
     
    mgooding likes this.
  31. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    It just adds a set of spinning cubes to a scene (to a spinning root transform) and monitors the fps -> https://arowx.itch.io/unity-cube-mark (WebGL)

    What about adding colliders for instanced meshes is there a quick way to do that, I'm noticing a massive difference between the scene with colliders and the one without?
     
  32. Cripple

    Cripple

    Joined:
    Aug 16, 2012
    Posts:
    92
    Do you know if it will be available on all platforms or if there are limitations (gpu, dx or opengl version ...). In case of limitation how will you handle fallback method ?
     
  33. Peter77

    Peter77

    QA Jesus

    Joined:
    Jun 12, 2013
    Posts:
    6,609
    I'd prefer an API where I can specify a startIndex and count. This allows me to use the same pre-allocated array for different things. Only having a count parameter implies that it has to start at index 0, which isn't really that powerful.
     
  34. Michal_

    Michal_

    Joined:
    Jan 14, 2015
    Posts:
    365
    If I understand this right you use SV_InstanceID and all the per instance data are stored in constant buffer. Wouldn't it be better to let us modify secondary (per instance) vertex buffer instead of constant buffer? I mean constant buffer is limited to 64KB. That's 1000 instances if they differ only in world matrix. It will be lot less in real world scenarios. Vertex buffer is limited to 128MB if I remember correctly. Not a big deal for sure, just a thought.

    Also, am I doing something wrong or instancing doesn't work for unlit shaders + deferred? Standard instanced shader works ok and both work in forward rendering.
     
  35. kite3h

    kite3h

    Joined:
    Aug 27, 2012
    Posts:
    197
    My game has wide areas. And bottle neck is always cascade shadow map.I dont care G-buffer pass. I want to reduce shadow cast draw call. Is GPU instancing applied in Shadow only renderer?
     
  36. yuanxing_cai

    yuanxing_cai

    Unity Technologies

    Joined:
    Sep 26, 2014
    Posts:
    335
    Yes. Instancing applies to shadow caster pass too.
     
  37. zeroyao

    zeroyao

    Unity Technologies

    Joined:
    Mar 28, 2013
    Posts:
    169
    In practice I believe reducing draw call number from 5000 to 10 makes a big change but reducing further to 1 if we adopt the secondary vertex stream approach won't be so worthy of the effort. That's part of the reason why we stick with our current approach while knowing that using secondary instancing vertex stream would help us achieving even less draw calls.

    If you are using instancing with the Unity's default renderloop, rendering batches are broken due to many reasons - z depth bucketing, shader state change, lighting change or probe changes. It is really hard to make perfect batching any way. If you really want them to be batched together I think the new DrawMeshInstanced API would help you do that at the expense of not being able to be culled and lit sophisticatedly.

    As for instancing not working under unlit+deferred, do you have any repro case for us to have a look? Thanks in advance.
     
    mgooding likes this.
  38. zeroyao

    zeroyao

    Unity Technologies

    Joined:
    Mar 28, 2013
    Posts:
    169
    We are just short of time to implement instancing on those platforms with 5.4 time frame. I think in 5.4 we will eventually have instancing on D3D11/D3D12, GL4.1+ and PS4.
    If the platform doesn't support instancing objects will be batched by dynamic batching or static batching if these two options are enabled.
     
    sqallpl, mgooding and hippocoder like this.
  39. sqallpl

    sqallpl

    Joined:
    Oct 22, 2013
    Posts:
    384
    @zeroyao

    I have a question about instancing but especially about SpeedTree billboards and objects that use light probes. Do you think that instancing/batching for billboards that use light probes will come at the same time when SpeedTree billboards instancing will be added? At the moment in 5.3 SpeedTree billboards that not use light probes (light probe usage turned off) are lit by one static ambient lighting 'snapshot' that is not affected by changes in ambient light source so it's impossible to use dynamic ambient lighting with them.

    Do you think that we will see instancing for billboards/objects that use light probes in 5.4 or one of 5.4.X releases maybe?

    Thanks

    EDIT: I've just noticed this in the instancing docs:
    "Objects that use lightmaps or affected by different light probes / reflection probes can’t be instanced."

    Does it mean that objects/billboards that use light probes can be instanced already but copies of object X that are affected by light probe A would be instanced together and copies of same object X that are affected by light probe B would be instanced together too but in two separate 'sets'?
     
    Last edited: Mar 30, 2016
  40. Michal_

    Michal_

    Joined:
    Jan 14, 2015
    Posts:
    365
    Sorry it took me so long but we have abandoned 5.4 for performance reasons. Instancing makes our game even more cpu bound for some reason... Anyway, here's the "unlit+deferred" repro project. There are two scenes, InstancingLit and InstancingUnlit. Instancing works for both scenes when forward rendering is used. And only InstancingLit works with deferred renderer. 5.4b14.
     

    Attached Files:

  41. elbows

    elbows

    Joined:
    Nov 28, 2009
    Posts:
    2,502
    I've become a tad out of date. Where are we at in regards instancing being added to the standard shaders and speedtree shaders that are included with 5.4? Thanks.
     
    Arowx likes this.
  42. Noisecrime

    Noisecrime

    Joined:
    Apr 7, 2010
    Posts:
    2,054
    Are you sure about instancing working for both scenes? Testing in b14 I find unlit doesn't instance in either forward or deferred.

    As to performance I concur, its pretty poor, though hugely better than simply rendering all the objects with a draw call each. GPU usage is pretty low, its cpu that gets hammered with the 8000 gameObjects you have in the test. However I guess thats just how it is, since Unity is having to track the gameObjects to instance that means its iterating through all 8000 every frame to build the instances matrix array and any other MaterialPropertyBlock values.

    It still feels rather high though, I wonder if anything can be done to improve it? I wonder if its having the gameObjects in the scene that is causing a significant overhead?

    I suspect you could get much better performance by writing a custom commandbuffer or DrawMeshInstanced version that avoids using gameObjects entirely. However in general terms it wouldn't be quite as straightforward to work with as having gameObject transforms.

    Edit:
    As a quick test I added two methods to auto-rotate each capsule. The first method was to add a monobehaviour to each capsule gameObject that rotated on update, the second was to store an array of transform references to the gameObjects in your init function and in its update rotate the references. In both cases performance dropped from 55 fps down to 18 fps and 25 fps respectively.

    This is not really unexpected, though still more of a drop than I would have expected. It suggests to me that in many cases one might want to perform the rotation in the shader instead of via a transform for so many objects. For example if you were making an asteroid field or something.

    I can't help but wonder if what is really needed is an API to control the internal Unity instancing? Some means of say over-riding Unity so it doesn't iterate through all the gameObjects and rebuild the matrix each time, but instead allows developers to feed it the matrix array? Or to provide a list of gameObjects you want instanced if perhaps that could reduce some of the overhead?

    I'm not sure, while it sounds attractive it might just make more sense all round for developers to simply write their own custom instancing system and shaders if they are running into performance problems with Unity's generic version.
     
    Last edited: Apr 13, 2016
    guycalledfrank likes this.
  43. Michal_

    Michal_

    Joined:
    Jan 14, 2015
    Posts:
    365
    Did you change the rendering mode directly on the camera? It probably isn't set to "Use Player Settings", so changing it in player settings does nothing. It works in forward for me (according to Frame Debugger).

    The performance increase in the test scene is nice. Nice for more or less single-threaded renderer anyway. I mean I would be happy for this kind of improvement but we can't get anywhere near that in our game (game with hundreds of trees, rocks, bricks etc). I didn't perform the benchmark myself but I'm told it runs worse than the same version without instancing. I'm going to have to investigate it when I have time.
     
  44. Noisecrime

    Noisecrime

    Joined:
    Apr 7, 2010
    Posts:
    2,054
    Ah, that's it, didn't realize you'd changed it on the camera. Yeah unlit works in foward.

    How are you getting good performance normally? As I said I suspect for specialized cases you'll have to get more creative and not rely on Unity's generalized internal method. Though were have to wait for DrawMeshInstanced() to be implemented before we have a 'relatively' straight forward way to test. You could do it now but only via DrawProcedural() which means converting the mesh into computeBuffers and dedicated shaders.
     
  45. Michal_

    Michal_

    Joined:
    Jan 14, 2015
    Posts:
    365
    We aren't :) It is an open world sandbox game where user can build pretty much anything. Draw call count can get astronomical at times. I was hoping instancing will help us but I'm told that's not the case. I'm assigned to a different project atm but I plan to get back soon and do proper profiling to see why instancing is so slow. Hopefully it is just a bug and it can get better.

    Yeah, the more low level access we get the better. Custom instancing would obviously work the best. We'll see how it evolves.
     
  46. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    I suspect it's the actual physics colliders, causing a massive reduction in instancing benefits.

    Try Cube Mark (link in sig) and note that I run 3 tests one without colliders that gets a massive cube count and another with colliders (the third draws quads) that gets way less in the way of performance.

    It's unfortunate that the PhysX system in Unity does not have the hardware acceleration mode available, although my understanding is that it works better or only on Nvidia hardware.

    Potentially if you combine colliders or reduce their number this should speed things up a bit.
     
  47. hippocoder

    hippocoder

    Digital Ape

    Joined:
    Apr 11, 2010
    Posts:
    29,723
    Pretty sure @Michal_ is comparing the same scene with and without instancing.
     
  48. elbows

    elbows

    Joined:
    Nov 28, 2009
    Posts:
    2,502
    I hope nobody minds me asking this again.
     
    wightwhale and hippocoder like this.
  49. hippocoder

    hippocoder

    Digital Ape

    Joined:
    Apr 11, 2010
    Posts:
    29,723
  50. elbows

    elbows

    Joined:
    Nov 28, 2009
    Posts:
    2,502
    I note that there is still no real answer to my question, or even a clue, and its been about a month since someone from Unity posted in this thread. This makes me very sad, what happened?