Search Unity

Native ComputeBuffer to Mesh (or better?)

Discussion in 'Shaders' started by EmmaEwert, Apr 28, 2015.

  1. EmmaEwert

    EmmaEwert

    Joined:
    Mar 13, 2014
    Posts:
    30
    I have been toying around with pushing as many points or triangles as possible in DX11 mode, in an attempt to simulate a tiny galaxy.

    I am quite pleased with the ~16-20 million billboarded sprites I am currently able to render on my measly GTX 770. It takes a few seconds to generate that on the CPU initially, though, so I now populate a ComputeBuffer of Vector3s through a ComputeShader.

    However, I found that while I was able to render 20 million billboards in about 16 milliseconds using Meshes, I would very much like to animate the vertex positions without having to do any significant amount of work on the CPU, for obvious reasons.

    I looked into Graphics.DrawProcedural, and while it does exactly what I want by rendering the vertices of the ComputeBuffer directly - presumably with less overhead than a full-blown Mesh, as well - it seems that in my very ill-controlled testing, I am now only able to render the before-mentioned 16 million billboards in ~25 milliseconds. Rendering points is even slower, and I was never able to control the size of the points - this is a requirement.

    The Meshes were built through a simple process of assigning 65535 vertices in each of 256+ Meshes, and exploding these vertices into billboards through a Geometry shader at render time.

    I'm not quite sure why rendering 256+ Meshes is faster than rendering a single 16-million vertex buffer - something with buffer indexing might have some constant overhead, perhaps?

    My questions then become:
    • Am I doing something wrong with DrawProcedural or ComputeBuffers in general that causes worse performance than rendering hundreds of Meshes?
    • If DrawProcedural is inherently less performant than rendering Meshes (and their assumed associated VBOs), is there a way to update the Mesh vertices from a ComputeBuffer without a trip through the CPU courtesy of ComputeBuffer#GetData?
    • Is it expected behaviour that rendering large numbers of points is as slow as - or even slower than - rendering the same amount of billboards (and thus four times the amount of vertices)?
    • Are there more performant ways than DrawProcedural and ComputeShader+ComputeBuffer for updating and rendering millions of vertex positions in realtime?
    Finally, any and all suggestions to increase the galaxy performance, star count - or both - while allowing animation will be greatly appreciated.

    I am more than willing to share any and all contextual code, but it's all pretty straightforward. That being said, let me know which parts might be helpful to look at, and I will happily paste them.


    For the curious, an image of most of the 16,776,960 stars can be found here.
     
  2. Zicandar

    Zicandar

    Joined:
    Feb 10, 2014
    Posts:
    388
  3. Zicandar

    Zicandar

    Joined:
    Feb 10, 2014
    Posts:
    388
  4. EmmaEwert

    EmmaEwert

    Joined:
    Mar 13, 2014
    Posts:
    30
    Hi Zicandar, thanks for referencing TC particles, that seems like an interesting project. Not exactly what I am looking for, but very close.

    I will definitely have a look at it, but $85 seems a bit much to pay if I am nearly where I want already - especially considering TC particles touts "hundreds of thousands, if not millions of particles" while I am currently doing tens of millions of particles myself (albeit without motion, yet). Furthermore, it seems opaque particles are required to reach the millions, and translucent/additive particles can't reach that in TC particles - my 20 million are already additive.

    That being said, while Geometry shaders might not be as optimised as the plain vertex/fragment duo, I did figure them to be at least faster than pushing four times as many vertices through the vertex shader (one for each vertex of a billboard, as opposed to just a single one for each billboard).

    Not to mention animating 20 million vertices has to be faster than animating the 80 million vertices required for 20 million billboards without using geometry shaders.

    Unity only supports 65535 vertices per Mesh, hence why I split the galaxy into 256+ meshes initially. DrawProcedural and the sister method DrawProceduralIndirect are - as far as I can tell - not limited in this way, as they reference a ComputeBuffer that doesn't have an index buffer.

    Regarding Geometry shaders being slow - I may not have been clear about this in the original post, but both my approaches towards 16 million particles use a Geometry shader. I am just confused as to why DrawProcedural is noticeably slower, and why either approach with 16 million points is slower than with 16 million billboards (each being 4 vertices).

    Essentially, the Geometry shader has to be four times as slow as not using it for there to be any gains in avoiding it. That, and each particle will take up four times as much VRAM when each billboard corner is stored - a 0.96 GiB of VRAM for 20 million billboards, severely limiting the theoretical maximum of my video card to ~40 million stars, with no VRAM to spare for the rest of the scene.

    I will definitely give your suggestions a go, though. As I understand it, I have these variations to try out, based on your suggestions:
    • Building a ComputeBuffer with 80 million vertices (20 million stars), thus skipping the Geometry shader and rotating them to face the camera in a vertex shader instead.
    • Using DrawProceduralIndirect; I have not yet given that a go as I can't immediately see a potential performance gain from using it in this project.
    • Having 1024+ Meshes, again skipping the Geometry shader and rotating billboards in the vertex shader.
    • Trying to achieve adequate visual fidelity with opaque (as opposed to additive) particles.
    Thanks again for your suggestions, they give me some approaches to try!

    I will report back with any findings.
     
    Last edited: Apr 28, 2015
  5. Zicandar

    Zicandar

    Joined:
    Feb 10, 2014
    Posts:
    388
    Sadly it wouldn't suprise me IF Geometry shaders were 4x slower, this because far to little effort has been put into optimizing them.
    And the sources I have heard say "Avoid using geometry shaders unless you really really need them" are people from places like microsoft, and similar in addition to other places. (Last year I was at one of the few non NDA conferences about the upcoming shader stuff, (Dx12, mantle, PS4, XBox1), and all who talked about optimization said it sadly.
    In your case it might still be worth using it, but it could be worth a try changing?

    But in general, I'm really confused as to what is happening to you!
    Have you submitted the bug report to unity?

    As for TC particles being slower, have you considered they might be sorting and doing some other stuff, not to mention target not as good computers/GPU's and have a LOT of stress from other parts? (Also your particles seem to not overlap much, and addative/blending in general is dependant on how much of the screen needs that.)
     
  6. EmmaEwert

    EmmaEwert

    Joined:
    Mar 13, 2014
    Posts:
    30
    Skipping the Geometry shader would mean using four times as much video memory - about 1 GiB, in fact. I gave the approach of not using the Geometry shader a go anyway. It was noticeably less performant, and of course took roughly four times as long to build Meshes with, as well.

    I am not considering submitting a bug report, because frankly I don't think the behaviour I am experiencing is a bug. I am just curious as to the reasons why DrawProcedural is slower than drawing hundreds of Meshes.

    As for TC Particles - I was in no way implying that their work is less valuable, just that whatever they're doing that drops the number of particles to single-digit millions means the system is of less use to me, presently.

    None of the four approaches I mentioned in my previous post proved fruitful by the way, sadly.
     
  7. Noisecrime

    Noisecrime

    Joined:
    Apr 7, 2010
    Posts:
    2,054
    While I can't offer any advice on improving your situation it would seem worth trying to figure out why you get worse performance using DrawProcedural. So perhaps it might be useful to check Unity FrameDebugger and better yet the Graphic debugger in VS2013 ( community edition is free ) which is a replacement for PIX. The Graphic Debugger will provide a wealth of information as to what is happening on the gpu and may provide a hint as to the performance difference and maybe then you can devise a better solution.

    One thought how are you defining the computeBuffer? Since there are different types perhaps that is adversely affecting the performance?
     
  8. andSol

    andSol

    Joined:
    May 8, 2016
    Posts:
    22
    I am facing the exact same problem as you were @Emma Ewert. I even put a question before seeing yours - but didn't get any replies (http://forum.unity3d.com/threads/in...-vertices-of-a-point-mesh-geom-shader.411388/). Did you manage to find more information on this?

    Perhpas @Aras could drop us some knowledge on the subject? I don't understand why using DrawProcedural to draw a mesh directly from within the GPU can be so much slower than going over all the CPU-GPU communication. It's not a problem of using geometry shader, since the OP's example (and also mine) compare using DrawProcedural against using pre-made meshes' vertices, but both using Geometry shaders equally.

    This is really troublesome and is blocking progress of my project at this moment.
     
    coidevoid and jason-fisher like this.
  9. BradZoob

    BradZoob

    Joined:
    Feb 12, 2014
    Posts:
    66
    necro but stumbled upon this because google doesn't like to throw anything away and the scant docs online make this incredibly painful to draw out so for the next weary Compute traveller; use DrawProceduralIndirect, but make sure you use the full parameter list or you'll get hit with a false pos from Unity debugger that the method is deprecated, it's only the overloaded method that is deprecated and it can be run in Update(), whereas DrawProceduralIndirectNOW must be run in the post render block (OnRenderObject), which brings all sorts of fancy problems with it. To see the parameters just follow the call through to the extracted class file and go through the various overloads. I think Unity might be the only engine that actually intentionally exposes Compute, the death star of gamedev tools, so I can hardly be mad about shockingly scant documentation :p
     
    R0man likes this.
  10. BoltScripts

    BoltScripts

    Joined:
    Feb 12, 2015
    Posts:
    20
    Double necro but Idunno why BradZoob suggested using DrawProceduralIndirect, it doesn't run any faster. It's exactly the same performance as DrawProcedural, only you'll waste your time dealing with annoying argument buffers and stuff trying to test it. Fact of the matter is, literally all forms of draw procedural are just a good 35% slower for no reason and there's nothing you can do about it. So if you care about performance you're stuck feeling stupid and using memory inefficient hacks to use the non procedural versions as a workaround.
    I'll eat my shoe if you can prove otherwise.
     
    flyer19 likes this.
  11. flyer19

    flyer19

    Joined:
    Aug 26, 2016
    Posts:
    126
    still slowing now .so ,how can make GPU DRIVEN with UNITY?the performance is so bad
     
  12. ArminRigo

    ArminRigo

    Joined:
    May 19, 2017
    Posts:
    20
    FWIW, the following approach gave me great performance on NVidia GPUs as well as very-high-end mobile GPUs (Adreno 650 from Quest 2; it didn't work so well on Adreno 540 from Quest 1).

    The exact shader I'm using is doing something more complex, but I'm fairly sure it would work just as well in this simpler case: write the shader as you would for DrawProcedural(), with a vertex shader that only takes a single argument, `uint vertex_id : SV_VertexID`. Write no geometry shader. The vertex shader produces all four corners of each quad one at a time, by reading from a StructuredBuffer that contains the star's position at index `vertex_id / 4`, and adding a small offset depending on the remaining two bits of vertex_id. Maybe you know all about this as you already tried it (or maybe you only tried with a separate Geometry Shader pass---in that case, that's the trick and you can stop reading now!)

    However, to call this shader, instead of using DrawProcedural(), you can make a Mesh with no positions or any other vertex information. At least in modern Unity, that's ensured with `mesh.SetVertexBufferParams(count /*and no additional arguments*/)`. Maybe it's actually enough to have a plain mesh with no special tricks---it will be using a shader with no declared input arguments in the vertex shader anyway, so the Unity logic shouldn't upload anything. And then you render this mesh normally, or with `Graphics.DrawMesh()` or something else. This approach makes DrawProcedural() basically unnecessary, and it gave me great performance, whereas people generally say that DrawProcedural() is slow...

    I can only speculate as to why, or if my experience can be reproduced in different settings. I would be interested to know if you try it!
     
    Last edited: Feb 14, 2024
  13. flyer19

    flyer19

    Joined:
    Aug 26, 2016
    Posts:
    126
    Last edited: Feb 27, 2024