Search Unity

Speed up the ComputeShader.GetData() function?

Discussion in 'Shaders' started by LordTyrion, Jun 8, 2017.

  1. LordTyrion

    LordTyrion

    Joined:
    Jun 5, 2017
    Posts:
    9
    I've been working on Unity and Compute Shaders for the past month, currently trying to implement GPU Flocking using Compute Shaders. I've written the C# script but when I executed it I noticed that the GetData function takes a huge amount of time to run. This time increases as the number of fish increase. I've gone through the other threads and read through them. The problem, as I understand, is caused when the CPU is forced to wait for the GPU to return data. Unfortunately, I wasn't able to find a solution to this problem and the last thread was create over a year ago?
    Has there been any solution since then?
     
    radiantboy likes this.
  2. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,348
    The main solution is solution is don't use GetData(). If you're using compute shaders to do something then you should never need that data back on the CPU again. For something like flocking you would want to use the positions that are calculated to drive a DrawMeshInstancedIndirect.

    The other option is I believe someone posted an asynchronous GetData on the forums someplace which will let you dispatch to the GPU and get the data back a few frames later.
     
    OhneHerz, ghostatspirit and dr0r like this.
  3. LordTyrion

    LordTyrion

    Joined:
    Jun 5, 2017
    Posts:
    9
    Thanks @bgolus for the help. I managed to implement the plugin and increase the FPS but its still not faster than the CPU. I wanted to try the DrawMeshInstancedIndirect but have no idea how. I have the values and positions in the compute buffer. How do I use that for the DrawMeshInstancedIndirect?
     
  4. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,348
  5. LordTyrion

    LordTyrion

    Joined:
    Jun 5, 2017
    Posts:
    9
  6. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,348
    It is explained there in the documentation, but it's easy to miss.

    The buffer you have of positions isn't the buffer that function wants. The buffer DrawMeshInstancedIndirect wants is the args buffer, which is explained as:
    In lay terms, the first number is effectively the number of triangles in the mesh multiplied by 3 (just use GetIndexCount(0) like in the example), the second number is how many you want to render, and the last 3 numbers are if you want to offset where you start rendering from (ie: if you want to skip the first few triangles of the mesh, or want to skip the first few instances). For the most part you'll likely just want to keep those set to zero.

    The positions buffer that you have can be setup any way you want, and it's up to the shader to parse that data into something useful. If you look at the example code it sets a separate positions buffer on the material, then the shader uses the data in the positions buffer to construct a world transform matrix.
     
  7. LordTyrion

    LordTyrion

    Joined:
    Jun 5, 2017
    Posts:
    9
    I think I get it now. We have one compute buffer for the arguments. In the vertex and surface shaders, we pass the values into a World Transform matrix. This shader is attached to the material right? And based on those values the mesh transforms. Is that right?

    What do we do for Compute Shaders then? Is there an equivalent World Transform matrix for compute shaders?
     
  8. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,348
    Run the compute shader as you currently do and have it fill a buffer with a bunch of Vector3 / float3 positions. Then set that same buffer on the material, call DrawMeshInstancedIndirect and read from it in the shader using unity_InstanceID as the array index.
     
  9. LordTyrion

    LordTyrion

    Joined:
    Jun 5, 2017
    Posts:
    9
    That makes sense. Thanks ! But I'm still experiencing a huge lag due to the gfx.WaitForPresent. I think i figured out why though. I know its not ideal, but I call the Compute Shader 'n' times, n being the number of fish. In the compute shader I have another for loop running n times. This essentially causes a n^2 complexity which increases as the number of fish increases. This causes a huge time complexity!
    Is there any other method for me to implement GPU Flocking? I tried the plugin and the GPU instancing but both of them are still slower than the CPU Flocking.
     
  10. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,348
    You mean from c# you call Dispatch() for each fish? No wonder that's so slow! You want to call Dispatch() as few times as possible per frame (ideally once!) and let the compute shader crunch through as much data in parallel as possible. I can't offer a ton of help with compute shaders though as I'm kind of terrible with them still.
     
  11. LordTyrion

    LordTyrion

    Joined:
    Jun 5, 2017
    Posts:
    9
    I need to calculate the distance between one fish and every other one each frame to find its new position and then repeat the process for the other fish. Is it possible to do that in one dispatch? If I put it in one dispatch then i will ave to put 2 for loops in the compute shader. Will that still cause a time lag? Any advice is welcome.
     
    Last edited: Jun 16, 2017
  12. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,348
    It'll be faster than calling Dispatch a several times, unless you're really doing something else wrong.
     
  13. LordTyrion

    LordTyrion

    Joined:
    Jun 5, 2017
    Posts:
    9
    I will look into that, thanks @bgolus ! On a different note, since I'm using the GPU for both rendering and computing, when I look at the GPU Profiler, I notice that Camera.Render is called every other frame. i.e. The GPU renders one frame and computes on the next frame. Because of this, I notice the application looks choppy, and not smooth. Additionally, whenever the GPU starts rendering the CPU has a huge spike due to something called gfx.WaitForPresent. I know this is because the CPU waits for the GPU to finish rendering, to get the computation value. My question is, is there any way to ensure both rendering and computing happen every frame? This will fix both the choppiness and the time spike.
     
  14. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,348
    Compute and render should be happening in the same frame, if they're not it might be because you're stalling the GPU with your compute shader. Honestly I highly suspect your compute shader is just "doing it wrong" since there are examples of GPU flocking using WebGL that can do thousands of agents at 60fps ... on a 4 year old iPad. Writing a compute shader the same way you would write code for a CPU is likely going to be highly inefficient and you tend to have to think about memory usage patterns and multi-threading to do it right (things I am bad at), otherwise you'll just spin the GPU.

    For example, if your compute shader just has a single loop of all the agents with a loop of all the agents inside, you've messed up. At the very least it should be a loop of some fraction of the agents that you're running multiple versions of in parallel.

    Something like:

    [numthreads(16,1,1)]
    ...
    int iterCount = ceil(_NumAgents / 16);
    int iterStart = SV_DispatchThreadID * iterCount;
    int iterEnd = min(_NumAgents, iterStart + iterCount);
    for (int i=iterStart; i<iterEnd; i++)
    // do code
     
    Last edited: Jun 17, 2017
  15. LordTyrion

    LordTyrion

    Joined:
    Jun 5, 2017
    Posts:
    9
    This is exactly what I needed!!!! Thank you @bgolus !!!
    One more quick thing... What do you know about how the Compute Shader works?
    I've looked everwhyer and still can't understand what to pass in the Dispatch function, what numthread values to use and what SV_DispatchThreadID is... Any help?

    Additionally, I read somewhere that I can sort the fish positions before calculating their next positions which will reduce the n^2 complexity. Thoughts?
     
    Last edited: Jun 17, 2017
  16. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,348
    A little secret ... as far as I can tell nobody knows what to put in for the values of Dispatch or [numthreads(#,#,#)]. It's not that people don't know what they do, but rather every time I read about those numbers I usually see a comment about "we tried these numbers initially because we thought they would work best, but then we tried these other numbers and they worked better, and we don't really know why." And these are really smart people who know this stuff inside and out. Basically know it's okay to try a bunch of different ways of doing it to try to figure out what's faster, and just be okay with knowing it might not be faster for everyone.

    As for what they are, they're ways of setting up your multi-threading. So 1,1,1 means run one thread. 16,1,1 means run 16 threads, 4,4,1 means run 16 threads as well (4x4x1), and 4,2,2 also means run 16 threads (4x2x2). How you determine what numbers to use usually depend on what kind of data you're trying to work on and how you want to break up that data. Like my example above, you could just set the number of threads to the number of agents you have and have your compute shader have only one loop as the dispatch will try to compute all of the agents at the same time.

    As for what SV_DispatchThreadID is, and how to actually use it (since my example pseudo code for sure won't work) I would suggest you look it up on Microsoft's site, or search for some tutorials like this one:

    http://kylehalladay.com/blog/tutorial/2014/06/27/Compute-Shaders-Are-Nifty.html

    Sorting the data first might be good too, but sorting is hard in compute shaders ... try looking at some other GPGPU or compute flocking examples that are out there (there are plenty) and try to see what tricks they're doing.
     
  17. LordTyrion

    LordTyrion

    Joined:
    Jun 5, 2017
    Posts:
    9
    Thank you @bgolus for all your help! GPU Flocking is finally faster. I implemented the pseudo code you gave me with some slight tweaking and managed to get a huge increase in the FPS. Once again, thank you!
     
    bgolus likes this.
  18. kumayu

    kumayu

    Joined:
    Oct 22, 2015
    Posts:
    4
    I need some help.
    Can you send me any simple sample project that works fast.
     
  19. ghostatspirit

    ghostatspirit

    Joined:
    Oct 22, 2017
    Posts:
    1
    Just for anyone who also want to speed up ComputeShader.GetData() / CommandBuffer.GetData(): If you are above Unity 2018.2, you can try the new AsyncGPUReadback.Request() to send an asynchronous request to the GPU.

    This new method is pretty handy in my use case. I use it for reading back the position and rotation of rigidbodies from my GPU-based physics system. I transferred about 8KB data every frame and measured the latency of each request. Most requests will only take 1 frame to finish, which is pretty acceptable for my project.
     
  20. akduy

    akduy

    Joined:
    Dec 31, 2016
    Posts:
    3
    Thanks! This helped me a lot!
     
  21. KoolGamez

    KoolGamez

    Joined:
    Apr 11, 2020
    Posts:
    29
    I am really new to shaders and GPU programming. For my first project, I am trying to implement flocking using a compute shader. I read the answers in this thread but a few things are still confusing. For flocking, I set the position of each boid in the compute shader on Start(). Then each Update, I dispatch the shader. And, instead of using getData() and then rendering the boids in the CPU, i should also set the compute buffer in the material shader. (just to clarify, the compute buffer is just a structured buffer of float3 positions).

    Now from here I am confused. How do I render a boid from the material's shader (both for a 2d boid and a 3d boid)?

    Please let me know how I can do this and also if I should change anything in my method described above to increase performance
     
  22. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,348