Search Unity

Performance and Mesh optimization

Discussion in 'Editor & General Support' started by hww, Dec 16, 2007.

  1. hww

    hww

    Joined:
    Nov 23, 2007
    Posts:
    58
    The documentation "Optimizing Graphics Performance" states:

    "The best way to improve rendering performance is to combine objects together so each mesh has around 1500 or more triangles and uses only one Material for the entire mesh."

    I understand this as: Less than 1500 triangles will be rendered almost as 1500 anyway because of batch cost.

    I know if I will render more triangles per mesh each triangle will be cheaper. But is this equation linear?

    Is there a sweet spot range of triangles? If yes, why?

    But what if my object will be rendered several times in one scene, is there any automatic cashing occurring? Or any other reason to require a sweet spot range of triangles?

    Image a level with 100K (or 250K) triangles. I could have only one 100K triangle object, or two each 50K, or 4 where each 25K and so on. Should these different combinations triangles have same FPS? If no, then why?

    The documentation "Modeling Optimized Characters":

    "How many polygons you should use depends on the quality you are going after. Anything between 500-6000 triangles is reasonable. If you want lots of characters on screen, you will have to sacrifice in poly count per character, if you want it to run on old machines, you will have to use less polygons per character. As an example: Half Life 2 characters used 2500-5000 triangles per character. AAA nextgen games running on PS3 or XBox 360 usually have characters with 5000-7000 triangles."

    I understand it like this: Less that 500 is not efficient because batch task costs. But more than 6000 will be not efficient cause it will eat CPU by skinning. Or is there is another reason for upper limit?
     
  2. hww

    hww

    Joined:
    Nov 23, 2007
    Posts:
    58
    I have made test app. The app generate procedural scenes and measure FPS for each. Each scene has 65K triangles but different number of object from 100 to 1. That is result of my experience: There is sweet region between 6K and 8K triangles. Can anybody explain that?
     

    Attached Files:

    Gigabitten_Gaming likes this.
  3. Aras

    Aras

    Unity Technologies

    Joined:
    Nov 7, 2005
    Posts:
    4,770
    All I can tell in a short sentence is that anything above a couple of thousand triangles is good (i.e. lower than that, you're most likely limited by batch cost on the CPU... on cards that do vertex processing in hardware).

    Your graphi does not really test the low-poly-count object cases (try testing with 50, 100, 300 triangles and so on). Also check which shader are you using (use VertexLit, as that is always one pass), and double check number of draw calls in game stats view.

    Now, back to the question: it might be that your tests don't end up with the same number of triangles for the graphics card to process. The graphics card processes vertices, and how many it has to process depends on the vertex count in the mesh, the amount of vertex sharing between triangles, and how well the mesh is optimized for the vertex processing cache.

    To test it "properly" you have to construct your meshes so that they use no vertex sharing at all, i.e. each triangle uses three vertices (and no other triangle uses those vertices). Then you'll know the exact number of vertices processed. If you already do this, then just ignore me.

    As to why the sweet spot seems to be exactly in 6-8 thousands - that depends on the video card, the CPU, the driver and so on. I'm actually surprised that the FPS starts falling with increasing triangle counts, I'd think it should not happen.
     
  4. hww

    hww

    Joined:
    Nov 23, 2007
    Posts:
    58
    It does but that part of graph is not interesting to me. Less triangles with more object will be slower that less objects with more triangles.

    I did that test also. But this test is specilay for sharing mesh option.

    Exactly. That was interesting to me, "will be performance go down after some number of triangles or not - when I render one mesh several times". Now I see it does and just interesting to know "Why?"
     
  5. Aras

    Aras

    Unity Technologies

    Joined:
    Nov 7, 2005
    Posts:
    4,770
    So you do share the vertices between triangles. Do you call Optimize() on the mesh after constructing it? (or alternatively, do you emit triangles in some sort of cache-friendly order?)

    I don't know. Unity does not do anything on a per-triangle basis, it just submits the mesh for the graphics card to render (as a big vertex/index buffer). So it's definitely not some slowdown in Unity because of a larger mesh. Unless you also have mesh colliders attached to those objects, in which case Unity will also build a collision mesh (which can take time... but I guess you build your meshes before starting benchmarking, and maybe even render a couple of frames before starting the benchmark counter, so all meshes are properly uploaded to VRAM).

    By the way, what hardware are you testing this thing on?
     
  6. hww

    hww

    Joined:
    Nov 23, 2007
    Posts:
    58
    Yes

    I have no collision. And use new iMac 20", 2.4GH, 4GB memory

    Just I am thinking - can it be cashing of graphics board? I mean while mesh size fit to a card's cash it render each instance faster instead one large.
     
  7. Aras

    Aras

    Unity Technologies

    Joined:
    Nov 7, 2005
    Posts:
    4,770
    No, I don't think that can be. Vertex data is usually put into VRAM (which should be enough in your case) or AGP/PCIe memory (which also should be enough). Other than that, video cards usually have a very small pre-transform cache (just a regular cache, probably a kilobyte or two in size), and a post-transform cache (usually 12-20 vertices in size). Neither of those explain why you see the sweet spot in 6-8 thousand range.

    Now it's getting interesting. Want to share the project that has your experiments?
     
  8. Jonathan Czeck

    Jonathan Czeck

    Joined:
    Mar 17, 2005
    Posts:
    1,713
    Maybe the ordering of the triangles is different enough to be significantly affecting the amount of overdraw, thus affecting FPS.

    -Jon
     
  9. hww

    hww

    Joined:
    Nov 23, 2007
    Posts:
    58
    Hard to believe that chip has no around 8K cash for vertexes. But who knows.

    Sure
     

    Attached Files:

  10. Aras

    Aras

    Unity Technologies

    Joined:
    Nov 7, 2005
    Posts:
    4,770
    GPUs usually have very small caches. They are optimized for throughput, not for latency, so they have high memory frequencies, wide buses but can use small caches. CPUs on the other hand, are optimized for "anything can happen now" case, so small latency is more important than high throughput, hence huge caches.

    Anyway, back to the question (warning - long post ahead). I did some tests on my own. The attached project (for Unity 2.0) has two tests:

    1. Draw Call test. This draws exactly same number of exactly same size polygons in exactly the same places, just using different batch count. In essence, it has lots of plane grids, and subdivides them into variable number of tiles. So the GPU vertex processing always stays more or less the same (sans the number of duplicate vertices across touching tiles), pixel processing is also the same, just the batch count differs.

    2. Vertex processing throughput test. This always draws same number of batches (500), with meshes that occupy the same screen area. Each test uses meshes subdivided into variable number of polygons. So vertex processing requirement on the GPU grows with each test - it has more vertices to process. The CPU should be loaded the same in each test though (number of batches is the same).

    I tested on first-gen MacBook Pro, Core Duo 1.83GHz, Radeon X1600, OS X 10.5.1. The GPU is not very fast (it's underclocked a lot from the "normal" X1600s). Important to test in standalone player, as in the editor we do a lot more error checking, plus other editor overhead (e.g. drawing the scene view).

    1. Draw call testing. I think the proper question to ask here is "how many draw calls I can afford?". Each draw call takes some CPU time; estimate how many FPS do you want to have (say, 60), how much CPU you want to leave for physics, game logic etc., and you'll end up with how much CPU time can you afford for submitting objects for rendering.

    On my machine, to draw all ~200 thousand vertices in this test, with spending all CPU on drawing: best to draw it in 160 batches (2401 vertices/batch, 145 FPS). If however I want to spend 10 milliseconds/frame of CPU on other tasks, then it's best to draw the whole thing in 40-90 batches (9409-4225 vertices/batch, 74.4 FPS - note that each frame I simulate "10 ms spent somewhere", hence absolute FPS is lower). If I want to spend 20 milliseconds/frame of CPU on other tasks, then it's best to draw in 40 batches (9409 vertices/batch, 42.7 FPS).

    My machine seems to be able to do about 90000 batches/second, if I max out the batch count. Note that batches here are very simple: just changing a mesh; in a real game you'll quite often change textures, shaders, colors and whatnot, so "real batches" might be more expensive.

    2. Vertex processing throughput
    My results are like this: each row is vertices/batch, and resulting processing rate in millions of vertices / second. Like said above, this is 500 draw calls per frame. Data:
    Code (csharp):
    1.  
    2. Verts/Batch  MVerts/s
    3.   25    0.5
    4.  121    2.3
    5.  289    5.0
    6.  529    8.3
    7.  841    12.3
    8. 1225    16.6
    9. 1681    20.9
    10. 2209    24.9
    11. 2809    28.7
    12. 3481    32.3
    13. // drop!
    14. 4225    25.1
    15. 5041    26.2
    16. 5929    27.0
    17. 6889    27.4
    18. // further it slowly goes up to 30.0 Mverts/s
    19.  
    You can see that the limit of this machine is about 30 million vertices/second (quite low... I blame Apple for underclocking it!). And except for a curious drop at about 4000 vertices/batch, the processing throughput increases with increasing batch sizes.

    Why at about 4000 vertices/batch there's a sudden drop in vertex processing performance - I don't know. My mesh vertices are position+color, that makes them 16 bytes (12 for position, 4 for color), so at 4000 vertices the mesh reaches 64 kilobytes of vertex data. Maybe Apple's OpenGL driver or this particular graphics card switches to some "slower mode" when mesh vertex data exceeds 64 kilobytes? I could imagine that the internal format of graphics card's push-buffers changes when some limit is exceeded, but I don't know for sure. But overall using larger batches makes the graphics card more happy.

    Ok, time to sleep now.
     

    Attached Files:

    FaffyWaffles and WaqasGameDev like this.
  11. shaun

    shaun

    Joined:
    Mar 23, 2007
    Posts:
    728
    Best thread I've read in ages! Thanks for going into the details Aras.
     
    WaqasGameDev likes this.
  12. jashan

    jashan

    Joined:
    Mar 9, 2007
    Posts:
    3,307
    Very interesting thread, indeed :)

    I've just played around with this on my 4-core Mac Pro 2,66 GHz with an ATI Radeon X1900, and with the Draw Call test, I get the best FPS with 40 batches (around 367), from there, FPS decreases. The maximum number of Batches/s is around 150000, if that matters, and that starts with a batch size of 1440, remains somewhat constant until 2560 and then drops.

    With 10ms/frame, I get about 88 FPS with 40 batches and less FPS with more batches. Batches/s behaves somewhat similar, with a maximum of around 135000 in the same area. With 20ms/frame max batches/s is around 132000 (2560 batches), max FPS is with 40 batches at about 45 and remains almost constant until 160, starting to drop with 360 to 40 FPS. At 1440 batches, I still get about 30 FPS.

    With a batch-size of only 10, I always get below 10 FPS. And always below 100 batches/s.

    Code (csharp):
    1.  
    2. Sleep: None
    3. # Batches    FPS   Batches/s
    4.    10         9.1        91
    5.    40       367.0     14682
    6.    90       362.5     32625
    7.   160       351.2     56188
    8.   360       327.7    117958
    9.   640       218.5    139815
    10.  1440       103.8    149405
    11.  2560        56.2    143999
    12.  5760        23.8    136898
    13. 10240        13.0    133220
    14. 23040         5.4    124758
    15.  
    The full results of the vertex processing throughput on my machine. I get a very significant drop at 15625 vertices/batch. And I find it somewhat interesting that Mv/s is more or less constant between 5476 and 15625 vertices/batch.

    Code (csharp):
    1.  
    2. card: ATI Radeon X1900 OpenGL Engine
    3. vendor: ATI Technologies Inc.
    4. api: OpenGL 2.0 [2.0 ATI-1.5.18]
    5. vram: 256
    6. os: Mac OS X 10.5.1
    7. cpu: Intel(R) Xeon(R) CPU 5150 @ 2.66GHz
    8. cpucount: 4
    9. ram: 5120
    10. unity: 2.0.1r2
    11.  
    12.  
    13.  
    14. v/batch  Mv/s   FPS
    15.    25     2.6   216.73
    16.    64     6.4   211.00
    17.   121    11.8   204.13
    18.   196    18.7   200.07
    19.   289    27.1   196.32
    20.   400    35.6   186.88
    21.   529    44.8   177.44
    22.   676    53.2   165.01
    23.   841    61.3   152.81
    24.  1024    68.8   140.99
    25.  1225    75.2   128.69
    26.  1444    81.5   118.41
    27.  1681    79.9    99.72
    28.  1936    90.6    98.16
    29.  2209    94.4    89.61
    30.  2500    97.0    81.39
    31.  2809    98.1    73.26
    32.  3136    99.5    66.56
    33.  3481   100.9    60.78
    34.  3844   102.4    55.86
    35.  4225   103.1    51.17
    36.  4624   105.2    47.70
    37.  5041   105.1    43.72
    38.  5476   106.7    40.85
    39.  5929   105.7    37.39
    40.  6400   105.8    34.68
    41.  6889   104.9    31.93
    42.  7396   105.7    29.96
    43.  7921   106.0    28.06
    44.  8464   106.4    26.36
    45.  9025   107.4    24.96
    46.  9604   107.6    23.49
    47. 10201   106.6    21.92
    48. 10816   107.2    20.79
    49. 11449   106.2    19.45
    50. 12100   107.2    18.59
    51. 12769   107.5    17.65
    52. 13456   107.6    16.77
    53. 14161   106.9    15.83
    54. 14884   107.8    15.18
    55. 15625   107.9    14.48
    56. 16384    67.7     8.67
    57. 17161    67.7     8.27
    58. 17956    67.7     7.91
    59. 18769    67.7     7.57
    60. 19600    67.6     7.23
    61. 20449    72.8     7.46
    62. 21316    68.8     6.77
    63. 22201    68.2     6.44
    64.  
    Warm regards,
    Jashan
     
  13. Aras

    Aras

    Unity Technologies

    Joined:
    Nov 7, 2005
    Posts:
    4,770
    I think the most useful number of that test is just how many batches/second can a machine do (no sleep, max. out batches). Then I know, ok, my CPU with the current driver can do about 100000 batches/second.

    So something like 107 million vertices/second is the practical maximum that your graphics card can do. The drop with very large batch sizes is somewhat expected, the driver probably switches to 32 bit index buffers or whatnot. What was unexpected for my is drop at 4000 vertices on my machine; 4000 is just a low number.
     
  14. brad_ict

    brad_ict

    Joined:
    Sep 14, 2010
    Posts:
    69
    The documentation indicates that combining meshes provides performance benefits. However, after doing some tests, it seems like meshes with the same material are auto-batched and result in 1 draw call.

    Questions:
    1. So why combine meshes that share a single material?
    2. Is there a frame rate performance benefit to combining the meshes even though it's the same number of draw calls for combined or uncombined meshes with the same material? (is that what this post is essentially indicating?)
    3. Are any of my conclusions incorrect?

    Test Results:

    1. 1 Mesh with 1 Material = 1 Draw Call, Batched 1
    2. 2 Meshes, Uncombined, with the same material = 1 Draw Call, Batched 2
    3. 2 Meshes, Combined, with the same material = 1 Draw Call, Batched 0
    4. 2 Meshes, Uncombined, with separate materials each = 2 Draw Calls, Batched 0
    5. 2 Meshes, Combined, with separate materials for each mesh = 2 Draw Calls, Batched 0

    Conclusions:

    1. Mesh draw calls are directly related to the number of materials applied to the meshes being drawn.
    2. If two meshes share the same material, whether they are combined or are separate, will be batched and result in a single draw call.
    3. Combining two objects with separate materials doesn’t give you any performance benefit, so don’t do it because you can then apply the two materials inside Unity to either mesh instead of having to apply the materials in Maya, combining and then exporting (i.e. you can't separate a mesh in Unity to apply materials to each mesh and then combine the mesh again).
     
  15. Dreamora

    Dreamora

    Joined:
    Apr 5, 2008
    Posts:
    26,601
    Another thing to keep in mind when you test on OSX with ATI is that nasty driver bug with higher poly meshes which causes an overproportial lose of performance beyond 15k - 20k polys per mesh.
    It can lead to pretty incorrect benchmark data compared to what you get on NVIDIA on osx or windows in general
     
  16. brad_ict

    brad_ict

    Joined:
    Sep 14, 2010
    Posts:
    69
    Does the new Auto-Batching feature in Unity 3 (was previously only on iPhone) basically mean you don't need to combine meshes with the same material manually and the docs just haven't been updated (where they say to "combine,combine,combine")? The meshes I'm talking about would be dynamic meshes in Unity Pro targeting PC Standalone.
     
  17. Dreamora

    Dreamora

    Joined:
    Apr 5, 2008
    Posts:
    26,601
    If they fullfill the requirements thats what it would mean yes (300 vertices per mesh, where this 300 assume base information ie position, normal and UV. if more is stored, the number is lower correspondingly)
     
  18. PolyMad

    PolyMad

    Joined:
    Mar 19, 2009
    Posts:
    2,350
    Sorry, is there any documentation about this feature?
    I'd love to know how to squeeze the most from these features and having the freedom to edit maps with small objects being then optimized by Unity itself.
     
  19. Dreamora

    Dreamora

    Joined:
    Apr 5, 2008
    Posts:
    26,601
    Yes there is documentation on dynamic batching in the manual.