Search Unity

  1. Megacity Metro Demo now available. Download now.
    Dismiss Notice
  2. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

Garbage Collection, Allocations, and Third Party Assets in the Asset Store

Discussion in 'General Discussion' started by Games-Foundry, Jun 21, 2012.

  1. Games-Foundry

    Games-Foundry

    Joined:
    May 19, 2011
    Posts:
    632
    AKA: PLEASE STOP FEEDING THE GC ANIMAL

    Given the most probable outcome that we are stuck with the slow garbage collection in Mono 2.6 until Unity 5.0 ( great if we do get an upgrade in 4.x but I don't hold out much hope ) I'm ruthlessly trying to minimize allocate/destroy to avoid feeding the gc as little as possible. For a while I've been doing this in the Folk Tale code - although evidently I still have some work to do looking at ObjectCullingManager - and now I've turned my attention to third party components.

    The profiler grab below illustrates what we're up against with components available in the asset store. The figures in red are what is being allocated by the garbage collector. Some components do this every frame, others only do it when the functionality is triggered. While allocation is not necessarily an evil thing, allocating and discarding data for garbage collection every frame is.

    $gc allocators.jpg

    VLight is first up. The VLight.OnWillRenderObject is allocating 8-10KB every frame in our particular scene. Looks like it could benefit from some caching.

    UnitySteer's AutonomousVehicle.FixedUpdate is next up, allocating 7.1KB every frame, but that'll be path data so is expected. I think I'll actually replace UnitySteer with A*Pathfinding as we no longer have use for it.

    A*Pathfinding is allocating sizeable chunks when a path is calculated, but we have a very large grid graph at fine detail, so that's to be expected. Although some effort to reduce this would be welcomed as we have a lot of characters moving around so it's fairly regular, and the KB varies depending on the path complexity. Perhaps have a pool of cached path points that is drawn from, and provide a repool function so we can return points that are no longer required?

    EDIT ( 30 JULY 2012 ): Aron Granberg has further improved A*Pathfinding, and allocation has dropped to a consistent 21 bytes per frame. It will probably see general release in 3.2.

    $cpu3-astarpathfinding and unitysteer.jpg

    Extending the investigation another day, it looks like some of Unity's own code might be contributing to the problem. The CharacterController code for example is allocating 10KB per update cycle:

    $cpu1-charactercontroller.jpg

    And here's what GetComponentsInChildren() allocates. In this instance I'll have to make sure I cache and reuse the array if possible, preventing it being marked for garbage collection.

    $cpu4-getcomponentsinobject.jpg

    I'd really like this thread to serve as encouragement for asset store authors to optimize their code to cache and pool as many objects and variables as possible to minimise any food for the evil gc monster. If other pro user community members are witnessing similar behaviour with other components not listed thus far, please post your profiler data and the function name, and notify the component author. Hopefully we can get more authors to think more about code performance.


    WHERE TO OPTIMIZE
    The greatest emphasis should be on allocating any reference based heap objects once at the start, and recycling them to avoid feeding the garbage collector. Beyond that, here are a few points to help developers with further optimizations backed by profiler data, but they will have a much lesser impact.

    - Recycling heap objects will prevent the garbage collector kicking in
    - Recycling is faster than using new because it avoids calling the constructor ( ctor() )
    - Value types are subject to garbage collection when used in classes
    - Local variables are significantly faster than member variables
    - There is no performance difference between for and while loops
    - SqrMagnitude is ever so slightly faster than Vector3.Distance
    - For loops are much faster than foreach loops
    - Caching Component.transform, .rigidbody, .audio etc is considerably quicker


    Disclaimer
    These tests have been run on a PC desktop. The results may differ on other platforms and I encourage you to run your own tests.

    Update
    This thread is now being used to document allocations within the Unity API to assist both the community and UT following on from this topic. If you are thinking of posting a bug here, please read all posts from here onwards to check it hasn't already been added, and be sure to file a bug report entitled "[Function Name] API call causes c# allocations".
     
    Last edited: Nov 27, 2012
    I am da bawss and twobob like this.
  2. kellygravelyn

    kellygravelyn

    Joined:
    Jan 22, 2009
    Posts:
    143
    Keep in mind that value types (such as structs) do not allocate on the heap and therefore do not live in the land of the garbage collector. You could could call "new Vector3()" hundreds of thousands of times per frame and not cause a single garbage collection. I'm not sure what real tangible performance impact constructing new structs will cause, but it absolutely will not manifest in a garbage collection.

    Otherwise in general I agree with your observations. Many people do not take care to reduce or eliminate heap allocations which will cause GCs, which can be especially painful on mobile platforms. All assets that do work at runtime should strive to avoid all allocations beyond initialization and startup, otherwise they impact the performance of their customers' games. It does take some extra work, but it results in a much better product.
     
  3. Games-Foundry

    Games-Foundry

    Joined:
    May 19, 2011
    Posts:
    632
    Yup, perhaps I should make a clearer distinction between allocation and garbage collection for the benefit of others. I've amended the OP accordingly.

    Allocation is not normally an issue if done once at the start rather than in per frame functions, and is mostly concerned with performance. Recycling is much faster than allocating new, because the constructor doesn't get called ( inc. structs ) and there's no allocation ( heap objects ).

    Feeding the gc monster with heap objects is best avoided because of the cpu spike.

    Here's the test code for performance of constructors v. recycling:

    Code (csharp):
    1.  
    2. public Vector3 v3;
    3.    
    4. public void Update ()
    5. {
    6.     // ctor() called each time, self ms 1.38
    7.     for ( int i = 0; i<100000; i++ )
    8.     {
    9.         v3 = new Vector3 ( 1f, 1f, 0.5f );
    10.     }
    11. }
    12.    
    13. public void Update ()
    14. {
    15.     // no ctor, self ms = 0.88
    16.     for ( int i = 0; i<100000; i++ )
    17.     {
    18.         v3.x = 1f;
    19.         v3.y = 1f;
    20.         v3.z = 0.5f;
    21.     }
    22. }
    23.    
    24. // CONCLUSION: re-cycle Vector3 and Quaternions
    25.  
     
    Last edited: Jun 21, 2012
    KristianBalaj likes this.
  4. Jaimi

    Jaimi

    Joined:
    Jan 10, 2009
    Posts:
    6,204
    Value types do not get garbage collected when they are on the stack. However, if they are allocated on the heap (by being part of a class, for example) they are of course garbage collected.
     
  5. Eric5h5

    Eric5h5

    Volunteer Moderator Moderator

    Joined:
    Jul 19, 2006
    Posts:
    32,401
    Just mentioning here that none of my stuff misbehaves like this. :) It's pretty much "0 B" in the memory column for everything, except where actual new objects are created if necessary. Nothing on a continuous basis though.

    --Eric
     
  6. Games-Foundry

    Games-Foundry

    Joined:
    May 19, 2011
    Posts:
    632
    I'm going to run some tests so we have hard facts to support this discussion. I'll be editing this post as I complete them.

    Test Objective
    Test @Jaimi's statement about value types being garbage collected when part of a class

    Case 1: Empty Class

    Code (csharp):
    1.  
    2.     public class MyClass
    3.     {
    4.     }
    5.    
    6.     public void Update ()
    7.     {
    8.         int i;
    9.        
    10.         // garbage collector kicks in
    11.         for ( i=0; i<100000; i++ )
    12.         {
    13.             MyClass myObj = new MyClass ();
    14.             myObj = null;
    15.         }
    16.     }
    17.  
    Outcome:
    - GC called less often.
    - 0.8MB collected each gc call

    $cpu.JPG



    Case 2: Class With Struct

    Code (csharp):
    1.  
    2.     public class MyClass
    3.     {
    4.         public Vector3 v3;
    5.     }
    6.    
    7.     public void Update ()
    8.     {
    9.         int i;
    10.        
    11.         // garbage collector kicks in
    12.         for ( i=0; i<100000; i++ )
    13.         {
    14.             MyClass myObj = new MyClass ();
    15.             myObj = null;
    16.         }
    17.     }
    18.  
    Outcome:
    - GC called more regularly
    - 1.9MB collected each gc call

    $gpu.JPG



    Case 3: Class With Struct, Constructor Called

    Code (csharp):
    1.  
    2.     public class MyClass
    3.     {
    4.         public Vector3 v3;
    5.     }
    6.    
    7.     public void Update ()
    8.     {
    9.         int i;
    10.        
    11.         // garbage collector kicks in
    12.         for ( i=0; i<100000; i++ )
    13.         {
    14.             MyClass myObj = new MyClass ();
    15.             myObj.v3 = new Vector3 ( 1f, 1f, 0.5f );
    16.             myObj = null;
    17.         }
    18.     }
    19.  
    Outcome:
    - GC called regularly
    - myObj.v3 = new Vector3 has performance overhead
    - 1.9MB collected each gc call

    $cpu2.JPG


    Conclusion
    Valid observation.
    Structs do get allocated to the heap when part of a class, and thus subject to garbage collection.
     
    Last edited: Jun 21, 2012
  7. Eric5h5

    Eric5h5

    Volunteer Moderator Moderator

    Joined:
    Jul 19, 2006
    Posts:
    32,401
    Speaking of performance, it's best to use local variables where possible (also makes for easier-to-maintain code). Using the above example, change it to this:

    Code (csharp):
    1. public void Update ()
    2. {
    3.     Vector3 v3;
    4.  
    5.     for ( int i = 0; i<100000; i++ )
    6.     {
    7.         v3.x = 1f;
    8.         v3.y = 1f;
    9.         v3.z = 0.5f;
    10.     }
    11. }
    and you should see a few ms shaved off.

    --Eric
     
  8. Jaimi

    Jaimi

    Joined:
    Jan 10, 2009
    Posts:
    6,204
    in this case, the vector3 is directly on the stack, and is not allocated or deallocated - only the stack pointer changes.
     
  9. Games-Foundry

    Games-Foundry

    Joined:
    May 19, 2011
    Posts:
    632
    Test Objective
    Are local variables quicker than member variables.

    Case 1: Member Variable

    Code (csharp):
    1.  
    2.     public Vector3 memberV3;
    3.    
    4.     public void Update ()
    5.     {
    6.         for ( int i = 0; i<1000000; i++ )
    7.         {
    8.             memberV3.x = 1f;
    9.             memberV3.y = 1f;
    10.             memberV3.z = 0.5f;
    11.         }
    12.     }
    13.  
    Outcome:
    8.3ms per frame

    $cpu9.jpg


    Case 2: Local Variable

    Code (csharp):
    1.  
    2.     public void Update ()
    3.     {
    4.         Vector3 localV3;
    5.        
    6.         for ( int i = 0; i<1000000; i++ )
    7.         {
    8.             localV3.x = 1f;
    9.             localV3.y = 1f;
    10.             localV3.z = 0.5f;
    11.         }
    12.     }
    13.  
    Outcome:
    5.1ms per frame

    $cpu10.jpg

    Conclusion
    Valid Observation.
    Local variables are considerably faster than member variables.
     
    Last edited: Jun 21, 2012
    ilmario, nxtboyIII and KristianBalaj like this.
  10. Games-Foundry

    Games-Foundry

    Joined:
    May 19, 2011
    Posts:
    632
    Test Objective
    Where to define boundaries, inside or outside the loop.

    Case 1: Boundary Definitions Inside Loop

    Code (csharp):
    1.  
    2.     public class MyClass
    3.     {
    4.         public Vector3 v3;
    5.     }  
    6.    
    7.     public List<MyClass> myList;
    8.    
    9.     public void Awake ()
    10.     {
    11.         myList = new List<MyClass>();
    12.         for ( int i=0; i<1000000; i++)
    13.         {
    14.             myList.Add ( new MyClass() );
    15.         }
    16.     }
    17.    
    18.     public void Update ()
    19.     {
    20.         for ( int i=0; i<myList.Count; i++ )
    21.         {
    22.             myList[i].v3 = new Vector3 ( 1f, 1f, 0.5f );
    23.         }
    24.     }
    25.  
    Outcome:
    - 23.8ms per frame

    $cpu5.jpg


    Case 2: Boundary Definitions Outside Loop ( aka hoisting )

    Code (csharp):
    1.  
    2.     public class MyClass
    3.     {
    4.         public Vector3 v3;
    5.     }  
    6.    
    7.     public List<MyClass> myList;
    8.    
    9.     public void Awake ()
    10.     {
    11.         myList = new List<MyClass>();
    12.         for ( int i=0; i<1000000; i++)
    13.         {
    14.             myList.Add ( new MyClass() );
    15.         }
    16.     }
    17.    
    18.     public void Update ()
    19.     {
    20.         int i;
    21.         int count = myList.Count;
    22.        
    23.         for ( i=0; i<count; i++ )
    24.         {
    25.             myList[i].v3 = new Vector3 ( 1f, 1f, 0.5f );
    26.         }
    27.     }
    28.  
    Outcome:
    - 19.5ms per frame

    $cpu8.jpg


    Case 3: Mix Of Inside and Outside

    Code (csharp):
    1.  
    2.     public class MyClass
    3.     {
    4.         public Vector3 v3;
    5.     }  
    6.    
    7.     public List<MyClass> myList;
    8.    
    9.     public void Awake ()
    10.     {
    11.         myList = new List<MyClass>();
    12.         for ( int i=0; i<1000000; i++)
    13.         {
    14.             myList.Add ( new MyClass() );
    15.         }
    16.     }
    17.    
    18.     public void Update ()
    19.     {
    20.         int i;
    21.        
    22.         for ( i=0; i<myList.Count; i++ )
    23.         {
    24.             myList[i].v3 = new Vector3 ( 1f, 1f, 0.5f );
    25.         }
    26.     }
    27.  
    Outcome:
    - 25.1ms
    - Slightly surprising outcome, but doesn't change our conclusion.

    $cpu7.jpg


    Conclusion
    'Hoist' boundary definitions to be outside the loop.
     
    Last edited: Jun 22, 2012
  11. Eric5h5

    Eric5h5

    Volunteer Moderator Moderator

    Joined:
    Jul 19, 2006
    Posts:
    32,401
    One unfortunate thing about Mono is that it doesn't seem to have the optimization that .NET has, where if you use "for (int i = 0; i < array.Length; i++)", it removes the need for bounds checking in the array, since there's no possibility of i being outside the bounds, so it's actually faster than putting the initializer outside the loop. (Not sure if that applies to List.Count too, but I would expect so.)

    --Eric
     
    shkar-noori likes this.
  12. Games-Foundry

    Games-Foundry

    Joined:
    May 19, 2011
    Posts:
    632
    It looks like at least for complex containers such as Lists that it's quicker to move the boundary definitions outside the loop. The JIT optimizer for arrays should produce the outcome you describe but we'd have to test that.

    Perhaps other members could contribute their tests? These aren't exactly scientific tests I'm executing.
     
    Last edited: Jun 21, 2012
  13. Lypheus

    Lypheus

    Joined:
    Apr 16, 2010
    Posts:
    664
    I'll have to watch this, I know the JIT optimizes this all away in many of the cases listed above ... c# ... bleh ... cripes next i'll be back to unrolling loops too :(.
     
  14. Games-Foundry

    Games-Foundry

    Joined:
    May 19, 2011
    Posts:
    632
    Test Objective
    Is there a performance difference between while and for loops

    Case 1: For Loops

    Code (csharp):
    1.  
    2.     public class MyClass
    3.     {
    4.         public Vector3 v3;
    5.     }  
    6.    
    7.     public List<MyClass> myList;
    8.    
    9.     public void Awake ()
    10.     {
    11.         myList = new List<MyClass>();
    12.         for ( int i=0; i<1000000; i++)
    13.         {
    14.             myList.Add ( new MyClass() );
    15.         }
    16.     }
    17.    
    18.     public void Update ()
    19.     {
    20.         int i;
    21.         int count = myList.Count;
    22.        
    23.         for ( i=0; i<count; i++ )
    24.         {
    25.             myList[i].v3 = new Vector3 ( 1f, 1f, 0.5f );
    26.         }
    27.     }
    28.  
    Outcome:
    - 20.99ms

    $cpu11-for.jpg


    Case 2: While Loop

    Code (csharp):
    1.  
    2.     public class MyClass
    3.     {
    4.         public Vector3 v3;
    5.     }  
    6.    
    7.     public List<MyClass> myList;
    8.    
    9.     public void Awake ()
    10.     {
    11.         myList = new List<MyClass>();
    12.         for ( int i=0; i<1000000; i++)
    13.         {
    14.             myList.Add ( new MyClass() );
    15.         }
    16.     }
    17.        
    18.     public void Update ()
    19.     {
    20.         int i = 0;
    21.         int count = myList.Count;
    22.        
    23.         while ( i<count )
    24.         {
    25.             myList[i].v3 = new Vector3 ( 1f, 1f, 0.5f );
    26.             i++;
    27.         }
    28.     }
    29.  
    Outcome:
    - 20.98ms

    $cpu12-while.jpg


    Conclusion
    No difference. Probably because for and while loops are both evaluated the same in .NET.
     
    Last edited: Jun 21, 2012
  15. Noisecrime

    Noisecrime

    Joined:
    Apr 7, 2010
    Posts:
    2,051
    I wish I had the time to invest in this as I love finding most optimal use cases, but sadly I to busy.

    However I'm a little dubious over this test case, repeatedly using the same values, its not very 'real-world' and I would wonder if the compiler might not be able to do some 'unfair' optimisations itself. In my opinion it will be more valid to assign x,y,z using some basic equation so the values are different in each loop and be different for every update.

    Its also strange as i'm sure I profiled this myself in terms of pure performance and found the reverse was true, it was slightly counter-intuitive as you'd expect the overhead of new vector to add up. May have been due to other factors as it was a 'real-world' test. Like I said wish I had time to repeat my tests, in order to provide evidence. Maybe I'll find the time later.

    Anyway I think this is a great idea, will be interesting to see what you discover.

    Its also a shame you can't disable bounds checking in Unity/mono, I'm still not clear exactly when it comes into effect, but if it is, then removing it could certainly speed up some of my heavy loop/array code.


    Of course for real optimisations you'll want to me examining the output ISLM code, though thats not always fun ;)
     
    Last edited: Jun 21, 2012
  16. Games-Foundry

    Games-Foundry

    Joined:
    May 19, 2011
    Posts:
    632
    Test Obejctive
    Is there a speed difference between Vector3.Distance and SqrMagnitude?

    Case 1: Vector3.Distance

    Code (csharp):
    1.  
    2.     public Vector3 pointA = Vector3.zero;
    3.     public Vector3 pointB = Vector3.one;
    4.    
    5.     public void Update ()
    6.     {
    7.         int i;
    8.         float distance;
    9.        
    10.         for ( i=0; i<1000000; i++ )
    11.         {
    12.             distance = Vector3.Distance ( pointA, pointB );
    13.         }
    14.     }
    15.  
    Outcome:
    - 33.5ms per frame

    $cpu13 - vector3distance.jpg



    Case 2: SqrMagnitude

    Code (csharp):
    1.  
    2.     public Vector3 pointA = Vector3.zero;
    3.     public Vector3 pointB = Vector3.one;
    4.    
    5.     public void Update ()
    6.     {
    7.         int i;
    8.         float distance;
    9.        
    10.         for ( i=0; i<1000000; i++ )
    11.         {
    12.             distance = ( pointA - pointB ).sqrMagnitude;
    13.         }
    14.     }
    15.  
    Outcome:
    - 30.5ms

    $cpu13 - sqrmagnitude.jpg


    Conclusion
    SqrMagnitude is a little faster than Vector3.Distance
     
  17. JohnnyA

    JohnnyA

    Joined:
    Apr 9, 2010
    Posts:
    5,041
    Your main point about garbage collection is a good one. I think getting in to the little test cases is getting off topic and confuses the post.

    "Structs do get allocated to the heap when part of a class, and thus subject to garbage collection."

    Structs may be allocated to the heap or the stack. The decision to do so is based on the expected lifetime of the variable (is it short term or long term?). Its up to the compiler, but generally all local variables (value or refs) used in an iterator block are stored in the heap.

    More detailed discussion at: http://blogs.msdn.com/b/ericlippert/archive/2010/09/30/the-truth-about-value-types.aspx

    EDIT: Oops wrong link
     
  18. Games-Foundry

    Games-Foundry

    Joined:
    May 19, 2011
    Posts:
    632
    Test Objective
    Noisecrime's "real-world" test simulation - recycling v new Vector3()

    Case 1: new Vector3()

    Code (csharp):
    1.  
    2.     public Vector3 v3;
    3.    
    4.     public void Update ()
    5.     {
    6.         for ( int i = 0; i<1000000; i++ )
    7.         {
    8.             float x = Random.Range ( 0f, 100f );
    9.             float y = Random.Range ( 0f, 100f );
    10.             float z = Random.Range ( 0f, 100f );
    11.            
    12.             // ctor() called lots, 127.8ms
    13.             v3 = new Vector3 ( x, y, z );
    14.         }
    15.     }
    16.  
    Outcome:
    - 127.8ms

    $cpu14 - ctor.jpg


    Case 2: Recycling

    Code (csharp):
    1.  
    2.     public Vector3 v3;
    3.    
    4.     public void Update ()
    5.     {
    6.         for ( int i = 0; i<1000000; i++ )
    7.         {
    8.             float x = Random.Range ( 0f, 100f );
    9.             float y = Random.Range ( 0f, 100f );
    10.             float z = Random.Range ( 0f, 100f );
    11.            
    12.             v3.x = x;
    13.             v3.y = y;
    14.             v3.z = z;
    15.         }
    16.     }
    17.  
    Outcome:
    - 108ms

    $cpu14 - recycle.jpg


    Conclusion
    Recycling is faster than new Vector3()
     
    Last edited: Jun 21, 2012
  19. jasonkaler

    jasonkaler

    Joined:
    Feb 14, 2011
    Posts:
    242
    Something to note is that when calling a method in c#, all parameters passed are copied each time, therefore passing an object only requires 4 bytes being copied, but the same struct will have to copy all it's values, so for example, passing a Vertor3 struct will take 3 times longer than passing a Vector3 object. You would also get better cache coherence as the original values will probably sit in the cpu cache.

    This may be trivial, but since we're talking performance, it's worth mentioning, especially if calls like that could happen thousands of times a second.
     
  20. Games-Foundry

    Games-Foundry

    Joined:
    May 19, 2011
    Posts:
    632
    Do you have time to send me a PM with a little code test so I can run the comparisons and upload the graphs?
     
  21. Eric5h5

    Eric5h5

    Volunteer Moderator Moderator

    Joined:
    Jul 19, 2006
    Posts:
    32,401
    It's definitely faster in all cases that I know of not to do "new Vector3" (or whatever struct) in a loop. I've benchmarked this a number of times in different real-world situations.

    It's apparently possible to run unsafe code in Unity (although not all platforms, such as the web player), but I haven't tried that yet.

    --Eric
     
  22. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,616
    So, do these tests have the same results on all platforms?

    If I were worried about mobile performance I sure as heck wouldn't rely on performance trials someone ran on a desktop PC...
     
  23. Games-Foundry

    Games-Foundry

    Joined:
    May 19, 2011
    Posts:
    632
    A valid point. Although someone else will have to contribute those test results as I need to get back to working on Folk Tale. I will add a disclaimer to the OP until metric results become available.
     
    Last edited: Jun 21, 2012
  24. dterbeest

    dterbeest

    Joined:
    Mar 23, 2012
    Posts:
    389
    Hi,

    I find this thread really interesting. Sadly though, i will have to go through my scripts checking these optimizations.

    I was wondering how much performance is won using GetComponent<T>() functions in the Start() instead of in Update(). I have recently bought a package from the Asset Store and noticed a whole lot of GetComponent() calls in the Update functions, so i would like to know if i need to go through these scripts as well and try to optimize them on this part.
     
  25. superpig

    superpig

    Drink more water! Unity Technologies

    Joined:
    Jan 16, 2011
    Posts:
    4,649
    Careful. That's the Microsoft CLR, not Mono. As Eric says, "Versions of C# provided by other vendors may choose other allocation strategies for their temporary variables."

    Still, thanks for that link - fascinating read. I think I, like many others, assumed value types are on the stack because that's kinda like how things worked in C++.
     
  26. Games-Foundry

    Games-Foundry

    Joined:
    May 19, 2011
    Posts:
    632
    @dterbeest Sometimes it's unavoidable having GetComponent<T> in update functions, for example where the object you're calling it on is not previously known. However, where you do know what the object is, it's going to make sense to cache the component at the start rather than getting it in Update(). I'll do the test to illustrate this.

    Edit: actually I'm not sure I can demonstrate this easily in a simple test. CPU caching may produce misleading results.
     
    Last edited: Jun 21, 2012
  27. superpig

    superpig

    Drink more water! Unity Technologies

    Joined:
    Jan 16, 2011
    Posts:
    4,649
    YES! Just because Unity gives you a garbage collector does not mean you should be using it.

    Note that the cost saving of not calling the constructor is often negated by the fact that you need to call some other kind of 'Re-init' method on the recycled object.

    Yes, when value types are used as part of a class or structure, they're collected as part of that class or structure - if the whole class is getting GCed, the fact that some particular member is a value type makes no difference.

    Do we have any explanation for why that is, yet?

    Yes; Vector3.Distance includes a square-root that SqrMagnitude does not.


    Disclaimer
    These tests have been run on a PC desktop. The results may differ on other platforms and I encourage you to run your own tests.[/QUOTE]
     
  28. JohnnyA

    JohnnyA

    Joined:
    Apr 9, 2010
    Posts:
    5,041
    Yeah, that's why I posted the full discussion. I expect that many of the decisions are the same as they seem to be clearly the best, but for others that may not be the case.

    EDIT:

    Likely to be Stack vs Heap. Regardless of speed, creating a member variable for something that only has a local scope seems pretty strange anyway.

    There are also some cases where the Mathf library is pretty slow.

    More common is using complex functions that do more than you need them to do (e.g. 3d maths when you only need 2d).

    However rarely are these issues actually bad enough to noticeably affect performance. I prefer to stick to a few obvious optimisations (cache components for example) but in general write easy to read code, and optimise if you need to.

    Another EDIT:

    Be careful of skewing your results because of other choices.

    You compare sqrMagnitude to magnitude by doing (a-b) every loop. If you instead (pre)calculate a-b and store it you should be able to get a clearer picture of the difference in performance.

    If calculating a-b each loop my system gives sqrMagnitude as ~ 1.3 times faster.
    If I pre-calcualte a-b and use the stored result my system gives sqrMagnitude as ~ 2.0 times faster.
     
    Last edited: Jun 21, 2012
  29. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,616
    I'd guess CPU cache. When it's a local the variable in question is created when it's used, so it'll most likely be in the CPU cache because its in a part of memory being actively operated on right now. When you access a member variable you're essentially calling up a random other piece of memory that's probably not already being operated on right now, so it has to be fetched. RAM is fast, but its nowhere near as fast as cache.

    Forthis reason, it's sometimes faster to re-calculate values than it is to fetch pre-calculated or code-cached ones. An optimisation specialist once told me that on a modern CPU you can do 5 square roots in the time required to do one memory fetch - so over-reliance on pre-calculated values can be false economy. (Note: thats for math or other stuff which doesn't require fetches to work out. Don'T tryto optimise something without understanding how it actually works and testing before and after!)
     
  30. kellygravelyn

    kellygravelyn

    Joined:
    Jan 22, 2009
    Posts:
    143
    My main point (perhaps poorly worded) is basically what the MSDN docs say about value types (emphasis added):

    I'd also warn that the metrics and benchmarks around the vectors are leading away from the real issue of assets that allocate reference type garbage. For example in the case of this benchmark the net result is a difference of 19.8 ms over 1,000,000 constructions of Vector3. If you average that out, you're only spending 0.0000198ms more per iteration (so one construction over one recycle is only roughly 0.0000198ms longer). The conclusion is technically correct (it is faster to recycle than construct), but the actual real world scenario requires you to iterate a huge number of vectors before it even starts to approach an amount of time that matters.

    So really my advice for developers is to focus on reference type allocations and get those to zero at runtime. If you get all of those, then use a profiler and start optimizing the cases that are taking up the most time. Don't just go through code adding all sorts of struct recycling until you know it's actually causing a problem.
     
  31. Games-Foundry

    Games-Foundry

    Joined:
    May 19, 2011
    Posts:
    632
    Yeah, in retrospect I should have put the other stuff in a separate thread. So instead I've broken one of my cardinal rules and put red bold text in the OP to emphasize the key message.
     
  32. Games-Foundry

    Games-Foundry

    Joined:
    May 19, 2011
    Posts:
    632
    As my investigation continues, it appears that Unity's own code may be contributing to the garbage collector monster.

    Here we can see CharacterController.Move feeding around 8-10KB per update cycle:

    $cpu1-charactercontroller.jpg

    I've also noticed that calls to GetComponentsInChildren() causes allocation, which it'll be important to cache and reuse:

    $cpu4-getcomponentsinobject.jpg

    I'm being hindered because even with 16GB RAM the Deep Profile keeps crashing after a few seconds of clicking in the overview. This error is after I've turned off record and stopped the editor player, and I start looking through the data.

    $crash.JPG

    Can someone from Unity please review and comment on these findings in light of the decision to keep us on Mono 2.6. Thank you.
     
    Last edited: Jun 22, 2012
  33. Thomas-Pasieka

    Thomas-Pasieka

    Joined:
    Sep 19, 2005
    Posts:
    2,174
    @gamesfoundry - Your best chance of getting feedback is to report a bug using the bug reporter from within Unity. Attach files/project and even video/images if necessary.
     
  34. superpig

    superpig

    Drink more water! Unity Technologies

    Joined:
    Jan 16, 2011
    Posts:
    4,649
    The GetComponents/GetComponentsInChildren methods *must* allocate memory, because they return arrays. No way around that.

    That occurred to me, but I don't think it makes sense given the example code used in the test - the first access to the member variable might be a cache miss, but on every subsequent access it should still be cached. That initial cache miss certainly doesn't account for a 3.2ms difference...

    That said, it's a write, not a read, and I can't quite remember how the cache behaves in that situation. There's some invalidation flags that need to be set, I think - but I think the same is true for local variables. (Though maybe that depends on whether they've been put into registers or not).
     
  35. Games-Foundry

    Games-Foundry

    Joined:
    May 19, 2011
    Posts:
    632
    Two bug reports filed. One for the profiler crash, one for the gc allocations. Images and link to this thread included, but not the 5GB project file.

    Edit: Sounds like allocation in GetComponentsInChildren is unavoidable (ref: Superpig), but why does CharacterController.Move need to allocate every frame?
     
    Last edited: Jun 22, 2012
  36. Games-Foundry

    Games-Foundry

    Joined:
    May 19, 2011
    Posts:
    632
    So is "GC Alloc" in the profiler showing memory allocations, and not data being marked for garbage collection?
     
  37. echtolion

    echtolion

    Joined:
    Jun 16, 2011
    Posts:
    140
    Wow, er, I didn't expect that to happen at all, I'm guessing this is a Mono issue likely related to 2.6 being rather outdated?

    edit:
    Did some generic benching myself, this doesn't happen at all in .NET(2.0 and 4.0), however it happened in Mono 2.6 and surprisingly Mono 2.10.8 aswell.
     
    Last edited: Jun 22, 2012
  38. superpig

    superpig

    Drink more water! Unity Technologies

    Joined:
    Jan 16, 2011
    Posts:
    4,649
    I believe it's showing memory allocations that will be GCed when they're no longer used - i.e. all managed allocations. This is as opposed to native allocations (e.g. textures) that aren't released by the GC. Pretty sure it's not a report of what the Boehm scanner identified as garbage in that frame, or anything like that.
     
  39. Games-Foundry

    Games-Foundry

    Joined:
    May 19, 2011
    Posts:
    632
    Test Objective
    To compare foreach loops with for loops using indexers, specifically looking at performance and allocation

    Case 1: foreach

    Code (csharp):
    1.  
    2.     public Transform characterWithProps;
    3.    
    4.     public void Update ()
    5.     {
    6.         int i;
    7.         string name;
    8.        
    9.         for ( i=0; i<10000; i++ )
    10.         {
    11.             foreach ( Transform t in characterWithProps )
    12.                 name = t.name;
    13.         }
    14.     }
    15.  
    Outcome:
    - regular garbage collection; t.name is causing allocation
    - 19.7ms

    $cpu6-foreach.jpg


    Case 2: for within indexer

    Code (csharp):
    1.  
    2.     // I've dragged an object in the scene that has several child transforms
    3.     public Transform characterWithProps;
    4.    
    5.     public void Update ()
    6.     {
    7.         int i;
    8.         int j;
    9.         int count;
    10.         string name;
    11.         Transform child;
    12.        
    13.         for ( i=0; i<10000; i++ )
    14.         {
    15.             count = characterWithProps.childCount;
    16.             for ( j=0; j<count; j++ ) {
    17.                 child = characterWithProps.GetChild ( j );
    18.                 name = child.name; 
    19.             }
    20.         }
    21.     }
    22.  
    Outcome:
    - Garbage Collector is still kicking in; name = child.name;
    - 10.7ms

    $cpu6-for.jpg


    Conclusion
    For with indexer loops are much faster than foreach loops. Not much difference though in terms of allocation, although the foreach loop has slightly more because of the enumerator.
     
    Last edited: Jun 22, 2012
  40. jasonkaler

    jasonkaler

    Joined:
    Feb 14, 2011
    Posts:
    242
    I'll see what I can do.
    I've never really benchmarked it myself and personally wouldn't go too far out my way to implement this, but I found that tip, along with many others here:
    http://www.dotnetperls.com/optimization
     
  41. tatoforever

    tatoforever

    Joined:
    Apr 16, 2009
    Posts:
    4,364
    Interesting,
    I always try to avoid runtime allocations at all cost. Caching whatever i can (i know is harder but i always like to squeeze the last bit of performance out of my code). Results, it's blazing fast.
    Here is a profiler shot of our current game (Forgotten Memories):
    $20120622-x5eo-63kb.jpg
    Not fully visible but what you see there is a combination of updates calls, coroutines, delegates and events (the entire game framework is running there btw). As you can see, i do almost no allocation at runtime (this can of course differ from game to game depending on its features) but in our case it was possible so i take advantage of. :)
    PS: Going to try out all your test cases when i have some free time.
     
  42. Dreamora

    Dreamora

    Joined:
    Apr 5, 2008
    Posts:
    26,601
    What about replacing List<> with LinkedList<> or alternatively initiate the List in the proper size to not make the constant resizing 'crippling' the outcome for a real situation cause the real one will never contain 1M objects, otherwise you will have a whole different host of problems even more so on mobile and as such the overhead from the internal resizing plays a major role in the outcome due to the copy operations
     
  43. Games-Foundry

    Games-Foundry

    Joined:
    May 19, 2011
    Posts:
    632
    @tatoforever - have you used any third party solutions? What do you do about pathfinding?

    Third-party solutions, coupled with the Unity CharacterController seem to be what's dragging things down right now. The day may yet come when I have to purge nearly all third-party code. Thankfully NGUI seems very well behaved so that will be staying.
     
  44. tatoforever

    tatoforever

    Joined:
    Apr 16, 2009
    Posts:
    4,364
    I'm only using EZGUI for the entire interface (btw it's an iOS game).
    The rest is Unity pro build-in stuff (Navmesh, Umbra, etc).
    CharacterControllers are also very heavy (I only use 1 in my game, the player) the rest of characters are Navmesh agents (Unity use a simplified controller for the agents).
    Runtime allocations is heavy, even on desktop computers. Try to cache as much as possible.
    What third party plugins are you using?
     
  45. n0mad

    n0mad

    Joined:
    Jan 27, 2009
    Posts:
    3,732
    Interesting topic.
    While we're speaking about core data optimizations, I've shared a data struct format I came to create recently, which I'm using for very complicated AI calculations and predictions. In my project these operations are sometimes required more than once per frame (like in a row of deterministic calculation routines), so I had to find a way to cut the crap out of redundancy.
    The idea is to know if a calculation is needed more than once per a certain period of time (default : 0.05 sec).

    Here is the script : http://forum.unity3d.com/threads/19299-fighting-game-Kinetic-Damage?p=959296&viewfull=1#post959296

    This can also greatly reduce that heap.
    If it can help anyone, serve yourself.

    Another hugely underknown big optimization is that when you're using a Dictionary, the default comparer usually sucks when you're not using value keys (like when you're using enums, classes, or even strings).
    The biggest perf hit is when you're using Enum as keys, the dictionary will use reflection to compare... which is horribly slow.
    So you have to create your own EnumComparer class, and insert it with your dictionary creation.

    For enums, the class would be :

    Code (csharp):
    1. class MyEnumComparer : IEqualityComparer<MyEnum>  
    2. {
    3.     public static readonly MyEnumComparer Instance = new MyEnumComparer();  
    4.     #region IEqualityComparer<MyEnum> Members  
    5.     public bool Equals(MyEnumx, MyEnumy) {
    6.     return (x == y);  
    7.     }  
    8.     public int GetHashCode(MyEnum obj) {
    9.     return (int) obj;  
    10.     }  
    11.     #endregion  
    12. }
    Then you use the
    Code (csharp):
    1. new Dictionary<MyEnum, myData>(MyEnumComparer.Instance)
    to initialize it.
    This makes any Dictionary operation way, way faster (especially with Enum).
    props to Vojislav Stojkovic for this : http://beardseye.blogspot.fr/2007/08/nuts-enum-conundrum.html

    For string keys, it's better to use System.StringComparer.Ordinal as the comparer, as it doesn't do anything more than just checking each letter equality (as opposed to the default comparer, it seems).
     
    Last edited: Jun 22, 2012
  46. Arges

    Arges

    Joined:
    Oct 5, 2008
    Posts:
    359
    Hah, cool! Thanks for that - I just noticed that Vehicle is not using the cached rigidbody value. I'll fix that on the development branch.

    Cheers!
     
  47. Games-Foundry

    Games-Foundry

    Joined:
    May 19, 2011
    Posts:
    632
    Great! Glad some of these test are proving useful.

    @Nomad I wondered when you would make an appearance :) I've read your optimization posts in the past too, so your input to the discussion is most welcome. I'll give that struct code a read over when I'm less tired.

    @tatoforever Well we have a challenge. Detailed outdoor scenes are a bitch, it has to be said. The amount of content on screen at any one time is quite high as you can probably tell from Folk Tale screenshots. I'm using every trick I know to squeeze more performance out.

    I'm using character controllers for all characters at the moment, which are indeed expensive. I tried rigidbodies several months ago and just couldn't get it behaving exactly how I wanted. I may end up trying again now before beta starts. A*Pathfinding in grid graph mode because we have building construction and need to recalculate areas of the graph at runtime ( not an option with navmesh solutions ). Megafiers is used for some real-time deformation and all facial animation. iTween for some stuff including LOD transitions ( scale / fade ) and cutscene cameras, but it's use has been declining as I rewrite. UnitySteer is still in there for water transport, but I'm migrating that code to use A*Pathfinding List Graphs as the routes are always pre-defined. And finally VLights for volumetric lighting, but that'll get the boot if the author doesn't make any progress with optimizations.
     
  48. n0mad

    n0mad

    Joined:
    Jan 27, 2009
    Posts:
    3,732
    Thank you :)
    Actually I'm not seeing myself as a huge pro on C# so I'm not that confident in an active participation, but if anything can help, I'll be glad to :)

    edit :
    Thanks Tato (your profiler sample is impressive !) :) Hey that makes me think, there are several optimization tips threads all over the forum, but not a centralized one. Could be beneficial for everyone to gather all that stuff around and centralize somewhere (like, here, precisely) ? (à la iOS forum FAQ "General Performance" section).
     
    Last edited: Jun 22, 2012
  49. tatoforever

    tatoforever

    Joined:
    Apr 16, 2009
    Posts:
    4,364
    Of course yes. ;)
     
  50. superpig

    superpig

    Drink more water! Unity Technologies

    Joined:
    Jan 16, 2011
    Posts:
    4,649
    It just occurred to me that you're using Deep Profiling.

    Be very, very careful about drawing timing conclusions from such low-level structures with Deep Profiling. It adds substantial overhead to every new call stack frame, so when dealing with very simple structures (like basic loops), recording the sample can very quickly dwarf the time taken to actually perform the operation. In particular, simple structures that involve function calls (like Enumerator.MoveNext()) will seem a lot slower than those that don't because calling functions means recording more deep profiler samples.

    Deep profiling is useful for tracking down memory allocations, and it's useful if you know that something is taking 20ms and you want to see exactly what proportion of that is going where, but in terms of absolute timings I wouldn't trust it for a nanosecond.