Search Unity

Vector3 and other structs optimization of operators

Discussion in 'Scripting' started by Aka_ToolBuddy, Jun 17, 2017.

  1. Aka_ToolBuddy

    Aka_ToolBuddy

    Joined:
    Feb 25, 2014
    Posts:
    543
    Hi,
    While working on some real time mesh generating code, that did hundreds of thousands of Vector3 operations per frame, I was surprised to find that Vector3 (among other Unity structs) operators (*, +, ...) can be easily and massively optimized.

    The current implementation of the * operator in Vector3 is:
    Code (CSharp):
    1. public static Vector3 operator *(Vector3 a, float d)
    2. {
    3.     return new Vector3(a.x * d, a.y * d, a.z * d);
    4. }
    The optimized implementation I suggest is:
    Code (CSharp):
    1. public static Vector3 operator *(Vector3 a, float d)
    2. {
    3.     Vector3 result;
    4.     result.x = a.x * d;
    5.     result.y = a.y * d;
    6.     result.z = a.z * d;
    7.     return result;
    8. }
    I run some simple comparison test that you can find here https://dropb.in/ponda.nimrod, and the result was:
    When run 50000 times, the current Unity's operator took 18.9 ms to execute, while the optimized one took 2.5 ms.
    The reason behind this difference is that the optimized version avoids calling unnecessarily the Vector3 constructor.

    I opened a suggestion at Unity's feedback site, so please support it by voting for it so we can see this optimization integrated in Unity some day
    https://feedback.unity3d.com/suggestions/vector3-and-other-structs-optimization-of-operators


    Edit: The unity feedback website is no more. You can find these optims implemented in my asset called Frame Rate Booster

    Thanks and have a nice day.
     
    Last edited: Jun 9, 2020
  2. ThomasTrenkwalder

    ThomasTrenkwalder

    Joined:
    Jun 18, 2017
    Posts:
    10
    Interesting, I never thought about checking the performance of unitys math library stuff.
    I just tried out the operator you mentioned, and I do indeed get better performance when I roll my own struct and implement the operator as you suggest.
    Unitys Vector3 seems to take 30% more time for me, both inside the editor and inside a build (on 5.6.1f1).
     
    Aka_ToolBuddy likes this.
  3. Aka_ToolBuddy

    Aka_ToolBuddy

    Joined:
    Feb 25, 2014
    Posts:
    543
    Thanks a lot ThomasTrenkwalder for your tests, and thanks for voting for my suggestion, I hope it will make the Unity team consider the suggestion.
     
  4. CrystalConflux

    CrystalConflux

    Joined:
    May 25, 2017
    Posts:
    107
    Interesting. Considering that the current implementation is less verbose, and all that the constructor does is assign the corresponding fields, I wonder why the C# compiler doesn't optimize this?

    Have you tested this in standalone release mode? Maybe it only affects debug mode?

    When you tested it in standalone did you disable development build/script debugging?
     
    Last edited: Jun 18, 2017
    Bunny83 likes this.
  5. Aka_ToolBuddy

    Aka_ToolBuddy

    Joined:
    Feb 25, 2014
    Posts:
    543
    I wonder the same thing. It seems to me to be something the compiler can handle, but I suppose things are more complicated than what I imagine.

    I confirm the optimization works in those conditions as well. I used a heavier version of the script above, and used Fraps to get the FPS count (to exclude any Unity's profiler possible issue), and here are the results:
    - Optimized version: 53 FPS
    - Unoptimized version: 42 FPS
     
  6. Rick-Gamez

    Rick-Gamez

    Joined:
    Mar 23, 2015
    Posts:
    218
    Wow I didn't realize that running the constructor in this case would make that big of difference. (I'm self taught BTW) but thanks for this insight. I will keep this in mind when developing my stuff. Thanks for the info!
     
  7. lordofduct

    lordofduct

    Joined:
    Oct 3, 2011
    Posts:
    8,531
    yep, a constructor function is just that... a function.

    So it allocates a stack frame to call it.

    If you don't call the constructor though, it just allocates the memory needed for the struct with empty values.

    This is why struct's don't allow field initializers, they MUST be empty values. Where as classes always have a constructor phase, so it doesn't have this restriction.

    ...

    I find this a minor optimization, probably resulting from early Unity. I bet it came about because the unity devs were all C++ programmers first and foremost, and so didn't really consider the inner workings of the mono CLR. But it is a area of optimization that could potentially give a little oomph since vector construction is very common.
     
    Rick-Gamez likes this.
  8. Rick-Gamez

    Rick-Gamez

    Joined:
    Mar 23, 2015
    Posts:
    218
    Yeah I knew that the constructor is a glorified method basically but that sheds some light on how C# allocates it's frame steps so thank you for that info!
     
  9. ThomasTrenkwalder

    ThomasTrenkwalder

    Joined:
    Jun 18, 2017
    Posts:
    10
    Yup, no development build here. I measured the times using the .NET Stopwatch class.
    One would think that the compilers (either the C# one or the JIT) should be able to inline this constructor call, but apparently they just don't.

    Considering that working with Vector3s and other math structs is quite common in many games, optimizing these operators would provide a nice benefit, and it doesn't even look like a lot of work ^^
     
  10. Aka_ToolBuddy

    Aka_ToolBuddy

    Joined:
    Feb 25, 2014
    Posts:
    543
    I completely agree. I think that implementing these optimizations could be done in less than a man-day.
    You are welcome :) And please consider voting for the suggestion to hopefully make Unity's team implement it.
    https://feedback.unity3d.com/suggestions/vector3-and-other-structs-optimization-of-operators
     
  11. Aka_ToolBuddy

    Aka_ToolBuddy

    Joined:
    Feb 25, 2014
    Posts:
    543
    For those who are interested, here are the IL instructions for the optimized Vector3 multiplication

    Code (CSharp):
    1. .method public hidebysig static
    2.     valuetype [UnityEngine]UnityEngine.Vector3 Optimized_Multiplication (
    3.         valuetype [UnityEngine]UnityEngine.Vector3 a,
    4.         float32 d
    5.     ) cil managed
    6. {
    7.     // Method begins at RVA 0x20e0
    8.     // Code size 58 (0x3a)
    9.     .maxstack 3
    10.     .locals init (
    11.         [0] valuetype [UnityEngine]UnityEngine.Vector3,
    12.         [1] valuetype [UnityEngine]UnityEngine.Vector3
    13.     )
    14.  
    15.     IL_0000: nop
    16.     IL_0001: ldloca.s 0
    17.     IL_0003: ldarga.s a
    18.     IL_0005: ldfld float32 [UnityEngine]UnityEngine.Vector3::x
    19.     IL_000a: ldarg.1
    20.     IL_000b: mul
    21.     IL_000c: stfld float32 [UnityEngine]UnityEngine.Vector3::x
    22.     IL_0011: ldloca.s 0
    23.     IL_0013: ldarga.s a
    24.     IL_0015: ldfld float32 [UnityEngine]UnityEngine.Vector3::y
    25.     IL_001a: ldarg.1
    26.     IL_001b: mul
    27.     IL_001c: stfld float32 [UnityEngine]UnityEngine.Vector3::y
    28.     IL_0021: ldloca.s 0
    29.     IL_0023: ldarga.s a
    30.     IL_0025: ldfld float32 [UnityEngine]UnityEngine.Vector3::z
    31.     IL_002a: ldarg.1
    32.     IL_002b: mul
    33.     IL_002c: stfld float32 [UnityEngine]UnityEngine.Vector3::z
    34.     IL_0031: ldloc.0
    35.     IL_0032: stloc.1
    36.     IL_0033: br IL_0038
    37.  
    38.     IL_0038: ldloc.1
    39.     IL_0039: ret
    40. } // end of method test::Optimized_Multiplication
    and those for the unoptimized one

    Code (CSharp):
    1. .method public hidebysig specialname static
    2.     valuetype UnityEngine.Vector3 op_Multiply (
    3.         valuetype UnityEngine.Vector3 a,
    4.         float32 d
    5.     ) cil managed
    6. {
    7.     // Method begins at RVA 0xb5b8
    8.     // Code size 41 (0x29)
    9.     .maxstack 4
    10.     .locals init (
    11.         [0] valuetype UnityEngine.Vector3
    12.     )
    13.  
    14.     IL_0000: nop
    15.     IL_0001: ldarga.s a
    16.     IL_0003: ldfld float32 UnityEngine.Vector3::x
    17.     IL_0008: ldarg.1
    18.     IL_0009: mul
    19.     IL_000a: ldarga.s a
    20.     IL_000c: ldfld float32 UnityEngine.Vector3::y
    21.     IL_0011: ldarg.1
    22.     IL_0012: mul
    23.     IL_0013: ldarga.s a
    24.     IL_0015: ldfld float32 UnityEngine.Vector3::z
    25.     IL_001a: ldarg.1
    26.     IL_001b: mul
    27.     IL_001c: newobj instance void UnityEngine.Vector3::.ctor(float32, float32, float32)
    28.     IL_0021: stloc.0
    29.     IL_0022: br IL_0027
    30.  
    31.     IL_0027: ldloc.0
    32.     IL_0028: ret
    33. } // end of method Vector3::op_Multiply
    34.  
     
  12. Invertex

    Invertex

    Joined:
    Nov 7, 2013
    Posts:
    1,550
    Did a test because I was curious if the same issue would happen with the object initializer {} feature.

    Code (CSharp):
    1. public static Vector3 GetSomeVector3()
    2. {
    3.     Vector3 vec;
    4.     vec.x = 3.4f; vec.y = 2.3f; vec.z = 55.5f;
    5.     return vec;
    6. }
    7.  
    8. IL_0000 nop
    9. IL_0001 ldloca.s  vec
    10. IL_0003 ldc.r4    3.4
    11. IL_0008 stfld     System.Single UnityEngine.Vector3::x
    12. IL_000D ldloca.s  vec
    13. IL_000F ldc.r4    2.3
    14. IL_0014 stfld     System.Single UnityEngine.Vector3::y
    15. IL_0019 ldloca.s  vec
    16. IL_001B ldc.r4    55.5
    17. IL_0020 stfld     System.Single UnityEngine.Vector3::z
    18. IL_0025 ldloc.0
    19. IL_0026 stloc.1
    20. IL_0027 br.s      IL_0029
    21. IL_0029 ldloc.1
    22. IL_002A ret
    23.  
    24. public static Vector3 SomeNewVector3()
    25. {
    26.     return new Vector3 {x = 3.4f, y = 2.3f, z = 55.5f };
    27. }
    28.  
    29. IL_0000 nop
    30. IL_0001 ldloca.s  V_0 //Extra Instruction
    31. IL_0003 initobj   UnityEngine.Vector3 //Extra Instruction
    32. IL_0009 ldloca.s  V_0
    33. IL_000B ldc.r4    3.4
    34. IL_0010 stfld     System.Single UnityEngine.Vector3::x
    35. IL_0015 ldloca.s  V_0
    36. IL_0017 ldc.r4    2.3
    37. IL_001C stfld     System.Single UnityEngine.Vector3::y
    38. IL_0021 ldloca.s  V_0
    39. IL_0023 ldc.r4    55.5
    40. IL_0028 stfld     System.Single UnityEngine.Vector3::z
    41. IL_002D ldloc.0
    42. IL_002E stloc.1
    43. IL_002F br.s      IL_0031
    44. IL_0031 ldloc.1
    45. IL_0032 ret

    The object initializer method does also avoid the call to the constructor, but it still has two extra instructions, the important one being an initobj call, which is going to cause a bit of extra work to be done in the form of it initializing all the values of the struct to zero or null. So while that should still be a lot better than the call to the constructor, the local declaration and assignment still wins out.

    I'm really surprised the CLR doesn't optimize this initobj call out if it detects you're assigning to every value in the struct.
     
    glenneroo, bobisgod234 and Peter77 like this.
  13. Aka_ToolBuddy

    Aka_ToolBuddy

    Joined:
    Feb 25, 2014
    Posts:
    543
    Thanks for that extra information. Didn't tough to test that as well.
     
  14. TJHeuvel-net

    TJHeuvel-net

    Joined:
    Jul 31, 2012
    Posts:
    838
  15. Doug_B

    Doug_B

    Joined:
    Jun 4, 2017
    Posts:
    1,596
    I linked to it over on this other thread earlier on. Vote count has gone from 152 to 165 in three hours.

    I wonder why they cannot just fix the aforementioned request rather than create a whole new library that you have to know to get and integrate? I appreciate that a release of Unity (which is presumably what would be required) is no small matter. However, this does seem to be such a fundamental part of a 3D platform to reasonably have expectations of an efficient implementation.

    But then maybe I am simply missing something here. :)
     
  16. Invertex

    Invertex

    Joined:
    Nov 7, 2013
    Posts:
    1,550
    You are missing something :p
    That mathematics library isn't the "solution" to this tiny little problem here, it's completely unrelated to it. That mathematics library is designed to help ensure highly efficient compilation of your complex vector/matrix/etc.. math in general, helping it be tightly packed and memory efficient in the burst compiler.
    That mathematics library will be integrated in Unity... It's just that it's quite beta right now so people who want to mess with it right now can do so through the repository and also help find bugs or contribute improvements (at some point potentially).
     
  17. Doug_B

    Doug_B

    Joined:
    Jun 4, 2017
    Posts:
    1,596
    Ah, ok. I've got my wires crossed. That means my vote for improved struct performance may not have been wasted then - assuming that ever gets looked at. :)
     
  18. Peter77

    Peter77

    QA Jesus

    Joined:
    Jun 12, 2013
    Posts:
    6,609
    I rewrote the IL of some Unity's DLLs and measured performance of a few applications. My conclusion was that Unity Technologies can achieve quite some performance improvements, with very little work, with trivial changes only, without actually changing something in user-code.

    Yes, they do provide a new math lib, but to make use of it, you need to change your project. This probably give better performance, but it might also not be a trivial change. Therefore, if Unity would just change some simple code in their Vector classes, every existing Unity project would actually benefit from those changes automagically.

    Here are my findings:
    https://forum.unity.com/threads/wip...faster-without-any-changes-in-seconds.531169/
     
    glenneroo and Noisecrime like this.
  19. Doug_B

    Doug_B

    Joined:
    Jun 4, 2017
    Posts:
    1,596
    Interesting video. Thumbs up from me. :)
     
    Peter77 likes this.
  20. Aka_ToolBuddy

    Aka_ToolBuddy

    Joined:
    Feb 25, 2014
    Posts:
    543
    The new Unity mathematics library has definitely its benefits, that are higher than the optimizations this forum thread is about. But using that library means you have to modify/rewrite parts of your code. The Vector3 (and similar) optimization works with 0 modification on your code.

    What kills me the most is to know that this optimization should hardly take more than a man/day to Unity's developers to implement, which is peanuts knowing the increase of performance it creates (Peter77 spoke here about a 4% increase in his game). Knowing that people in Unity are aware of the existence of this optimization (Suggestion ticket + me writing to them), the most probable explanation I see is that the internal organization of the Unity company became so complicated that making such simple useful modifications became a daunting task.
     
  21. Aka_ToolBuddy

    Aka_ToolBuddy

    Joined:
    Feb 25, 2014
    Posts:
    543
    Wow, that's some great tooling there. Thanks a lot Peter for making this, and pushing the idea beyond where I stopped.
     
    Last edited: Jun 5, 2018
    Doug_B and Peter77 like this.
  22. Aka_ToolBuddy

    Aka_ToolBuddy

    Joined:
    Feb 25, 2014
    Posts:
    543
    Thanks a lot for spreading the word.
     
  23. Aka_ToolBuddy

    Aka_ToolBuddy

    Joined:
    Feb 25, 2014
    Posts:
    543
    Last edited: May 4, 2020
  24. scsc

    scsc

    Joined:
    Oct 22, 2016
    Posts:
    3
    I'm resurrecting the thread after almost 2 years, because it's still a top google result of phrases like "unity vector3 operator performance", and there have been no official updates yet. The suggestion at Unity's feedback site was also removed without any redirection, as they moved the feedback solely to the forums.

    Is there a reason why this solution cannot be integrated officially by Unity in classes like Vector3? It doesn't require any effort, and could be easily back-integrated even to LTS Unity versions like 2018. Even static methods like Vector3.Distance(a, b) internally use a constructor for no good reason, and Vector3.SqrMagnitude(a - b) instead requires users to use a single Vector3 parameter, which invokes a constructor as a part of the minus operator. This produces a huge visible performance difference within Unity Profiler, affecting all of your libraries, for example if you use Mirror as a multiplayer solution. The optimization asset from the last post isn't on the Asset Store anymore, and I've noticed there's some ILOptimizer solution at https://forum.unity.com/threads/wip...faster-without-any-changes-in-seconds.531169/ , but I don't see any official recommendations. Of course, Unity currently tries to transition towards the DOTS approach with Unity.Mathematics library, but that doesn't affect any existing projects whatsoever. So what should we do to increase the performance of our libraries? Also because it's not documented anywhere, are there some compilers that already take care of this isuse, such that people who trust their profiler waste their time chasing a problem that doesn't exist, or was it simply ignored by Unity for over 2 years?

    Futhermore, calling a function seems to have its own overhead (because it's not inlined), but I've observed that using the "in" (readonly ref) modifier of the parameter, such as "in Vector3 a" instead of "Vector3 a", seems to reduce this overhead by probably 10% - 50%, as the struct isn't unnecessarily copied. I think this should be simple for compilers to optimize automatically in static methods such as those in Vector3 class, but it also doesn't seem to be the case. Can anyone comment on that, such that people who also google for a solution can find some answers?

    Here are some futher available links, which are even from before a year 2018, as I wasn't able to find any new official information:
    2011 https://forum.unity.com/threads/vector3-operations-performance.103575/
    2015 https://answers.unity.com/questions/1033383/code-performance-when-to-use-new-on-vector3.html (comment mentions the new operator isn't a real "new", which doesn't seem be the case based on our benchmarks)
    2018 https://www.reddit.com/r/Unity3D/comments/7w0dvm/dear_unity_why_vector3_isnt_optimized/
    2018 https://answers.unity.com/questions/1524021/performance-of-vector-addition-vs-component-additi.html
    2020 https://answers.unity.com/questions/1698286/does-making-a-new-vector3-have-an-impact-on-perfor.html
     
    glenneroo and Lesnikus5 like this.
  25. Aka_ToolBuddy

    Aka_ToolBuddy

    Joined:
    Feb 25, 2014
    Posts:
    543
    Hi,

    Thanks for your interest in my asset, and for sharing the additional information about this subject.

    Unity recently discontinued its asset store's old domain name, that's why the link from my post to Frame Rate Booster wasn't working anymore. I updated that link in my post above. The links on my website toolbuddy.net should always be up to date.

    When it comes to why this isn't part of Unity yet, I am as annoyed as you, because like you said, it doesn't require any effort. I tried contacting people at Unity, and I either get no answer, or the irrelevant answer about using DOTS or Unity.Mathematics.

    Have a nice day
     
    scsc likes this.
  26. Aka_ToolBuddy

    Aka_ToolBuddy

    Joined:
    Feb 25, 2014
    Posts:
    543
    Hi again,
    I have an update about the optimizations discussed in this thread and IL2CPP.
    So like you might already know, IL2CPP transforms IL assemblies to C++ code, then builds that code targeting the selected platform. A reasonable assumption would be that the output of IL2CPP when using a non optimized IL assembly should be slower than the output when using optimized IL. Unfortunately this is not the case. In my tests the output using optimized IL assembly was even slower. Before explaining to you why, please keep in mind that I have virtually zero experience with c++, so please correct me if I am wrong in my explanation:

    This is how IL2CPP transforms a non optimized vector3 addition:
    Code (CSharp):
    1. public static Vector3 operator +(Vector3 a, Vector3 b)
    2. {
    3.     return new Vector3(a.x + b.x, a.y + b.y, a.z + b.z);
    4. }
    becomes
    Code (CSharp):
    1. // UnityEngine.Vector3 UnityEngine.Vector3::op_Addition(UnityEngine.Vector3,UnityEngine.Vector3)
    2. IL2CPP_EXTERN_C IL2CPP_METHOD_ATTR Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720  Vector3_op_Addition_m929F9C17E5D11B94D50B4AFF1D730B70CB59B50E (Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720  ___a0, Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720  ___b1, const RuntimeMethod* method)
    3. {
    4.     Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720  V_0;
    5.     memset((&V_0), 0, sizeof(V_0));
    6.     {
    7.         Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720  L_0 = ___a0;
    8.         float L_1 = L_0.get_x_0();
    9.         Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720  L_2 = ___b1;
    10.         float L_3 = L_2.get_x_0();
    11.         Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720  L_4 = ___a0;
    12.         float L_5 = L_4.get_y_1();
    13.         Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720  L_6 = ___b1;
    14.         float L_7 = L_6.get_y_1();
    15.         Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720  L_8 = ___a0;
    16.         float L_9 = L_8.get_z_2();
    17.         Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720  L_10 = ___b1;
    18.         float L_11 = L_10.get_z_2();
    19.         Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720  L_12;
    20.         memset((&L_12), 0, sizeof(L_12));
    21.         Vector3__ctor_m08F61F548AA5836D8789843ACB4A81E4963D2EE1((&L_12), ((float)il2cpp_codegen_add((float)L_1, (float)L_3)), ((float)il2cpp_codegen_add((float)L_5, (float)L_7)), ((float)il2cpp_codegen_add((float)L_9, (float)L_11)), /*hidden argument*/NULL);
    22.         V_0 = L_12;
    23.         goto IL_0030;
    24.     }
    25.  
    26. IL_0030:
    27.     {
    28.         Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720  L_13 = V_0;
    29.         return L_13;
    30.     }
    31. }
    As you can see, and as for the non optimized IL, the non optimized C++ code allocates a vector3 that is not used and overridden further.

    Here is the IL2CPP result when run on the optimized version
    Code (CSharp):
    1. public static Vector3 operator +(Vector3 a, Vector3 b)
    2. {
    3.    a.x += b.x;
    4.    a.y += b.y;
    5.    a.z += b.z;
    6.    return a;
    7. }
    becomes
    Code (CSharp):
    1. // UnityEngine.Vector3 UnityEngine.Vector3::op_Addition(UnityEngine.Vector3,UnityEngine.Vector3)
    2. IL2CPP_EXTERN_C IL2CPP_METHOD_ATTR Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720  Vector3_op_Addition_m929F9C17E5D11B94D50B4AFF1D730B70CB59B50E (Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720  ___a0, Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720  ___b1, const RuntimeMethod* method)
    3. {
    4.     {
    5.         float* L_0 = (&___a0)->get_address_of_x_0();
    6.         float* L_1 = L_0;
    7.         float L_2 = *((float*)L_1);
    8.         Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720  L_3 = ___b1;
    9.         float L_4 = L_3.get_x_0();
    10.         *((float*)L_1) = (float)((float)il2cpp_codegen_add((float)L_2, (float)L_4));
    11.         float* L_5 = (&___a0)->get_address_of_y_1();
    12.         float* L_6 = L_5;
    13.         float L_7 = *((float*)L_6);
    14.         Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720  L_8 = ___b1;
    15.         float L_9 = L_8.get_y_1();
    16.         *((float*)L_6) = (float)((float)il2cpp_codegen_add((float)L_7, (float)L_9));
    17.         float* L_10 = (&___a0)->get_address_of_z_2();
    18.         float* L_11 = L_10;
    19.         float L_12 = *((float*)L_11);
    20.         Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720  L_13 = ___b1;
    21.         float L_14 = L_13.get_z_2();
    22.         *((float*)L_11) = (float)((float)il2cpp_codegen_add((float)L_12, (float)L_14));
    23.         Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720  L_15 = ___a0;
    24.         return L_15;
    25.     }
    26. }
    As you can see, the unnecessary memory allocation is no more, but for some reason the access to the vector x,y and z fields is done in a complicated and slow way, which nullifies the optimization done at the IL level.

    So if the IL2CPP generated code accessed the x,y and z fields the simple and fast way, the optimization of the IL could be useful also for projects using IL2CPP.

    I am planning on contacting someone on Unity about this. I hope this time they will be responsive. I will keep you update if I have any answer, and of course I will update my asset to be compatible with IL2CPP once the problem is solved.

    Please share with me your thoughts, and have a nice day
     
    Last edited: Jun 10, 2020
  27. Kamyker

    Kamyker

    Joined:
    May 14, 2013
    Posts:
    1,090
    Btw Unity.Mathematics uses similar c# code:
    https://github.com/Unity-Technologi...2c4b/src/Unity.Mathematics/float3.gen.cs#L224
    Code (CSharp):
    1. [MethodImpl(MethodImplOptions.AggressiveInlining)]
    2.         public static float3 operator + (float3 lhs, float3 rhs) { return new float3 (lhs.x + rhs.x, lhs.y + rhs.y, lhs.z + rhs.z); }
    Interesting that's it's slower than default but I guess AggressiveInlining makes it faster anyway in il2cpp
     
  28. Aka_ToolBuddy

    Aka_ToolBuddy

    Joined:
    Feb 25, 2014
    Posts:
    543
    I don't believe that the inlining is what explains the performance of Mathematics.

    When speaking about Vector3 (and similar), the fact that the addition (for example) is implemented as "return new Vector3(a.x + b.x, a.y + b.y, a.z + b.z);" instead of my optimized implementation (that does not call the constructor) is not the problem. The problem is that the compiler is not smart enough to optimize Unity's implementation by skipping the constructor's call. My implementation is just a way to force the compiler to not call the constructor. From my understanding, other C# compilers do that optimization.

    I am not familiar with Unity.Mathematics, but from my understanding it uses a different, specially optimized, compiler. So there is nothing strange in having different performance between the addition of float3 and Vector3 even if they have the same C# implementation.
     
  29. JoshPeterson

    JoshPeterson

    Unity Technologies

    Joined:
    Jul 21, 2014
    Posts:
    6,931
    It is important to note that the two different C# code snippets here do to vastly different things.

    Code (CSharp):
    1. public static Vector3 operator +(Vector3 a, Vector3 b)
    2. {
    3.     return new Vector3(a.x + b.x, a.y + b.y, a.z + b.z);
    4. }
    This method creates a new Vector3 that represents the sum of a and b.

    Code (CSharp):
    1. public static Vector3 operator +(Vector3 a, Vector3 b)
    2. {
    3.    a.x += b.x;
    4.    a.y += b.y;
    5.    a.z += b.z;
    6.    return a;
    7. }
    This method is something like a += operator, adding a and b and storing the result in a.

    In this case, it is not a matter of missing optimization, unfortunately. IL2CPP is doing the minimum that needs to be done in both cases.

    Thanks for the investigation though! I'd recommend you have a look at the latest 2020.2 alpha release of Unity. We've made some changes to improve the performance of Vector3 (and similar math operations) recently.
     
  30. Aka_ToolBuddy

    Aka_ToolBuddy

    Joined:
    Feb 25, 2014
    Posts:
    543
    But the end result is the same, right?: you get the sum of A and B. Since Vectors are copied by value and not by reference, it doesn't matter that we create a new vector or += an existing one.

    Just to state clearly what I think can be improved:
    In the IL2CPP output of the default Vector3 + operator implementation, here is how x values are accessed:
    Code (CSharp):
    1. Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720  L_0 = ___a0;
    2. float L_1 = L_0.get_x_0();
    Simple.

    In the IL2CPP output of the optimized Vector3 + operator implementation, here is how x values are accessed:
    Code (CSharp):
    1. float* L_0 = (&___a0)->get_address_of_x_0();
    2. float* L_1 = L_0
    3. float L_2 = *((float*)L_1);
    We agree that this code is too complicated, right? Can't IL2CPP be enhanced to avoid such complicated and slow access to x?

    I will soon, thanks for the information.
     
  31. JoshPeterson

    JoshPeterson

    Unity Technologies

    Joined:
    Jul 21, 2014
    Posts:
    6,931
    Yes, good point. I was thinking of reference types, but Vector3 is a value type. You are correct.

    Still, IL2CPP can only do what the IL code tells it to do, so it is doing the right thing in both cases, although the C# and resulting IL code could be more efficient.

    No, this is not too complicated. If the IL code indicates that the address of each field should be accessed, then IL2CPP must do that.

    Note that IL2CPP is almost never attempting to optimize IL code. It is instead transpiling it to C++ mostly as-is (it does a few optimizations, but not many). Our goal is to use the C++ compiler to do the optimizations. I'd be interested to see the output assembly code in both of these cases - I suspect that it will be similar.
     
  32. Baste

    Baste

    Joined:
    Jan 24, 2013
    Posts:
    6,334
    Yup, @JoshPeterson, @Aka_ToolBuddy is right. Since Vector3 is a struct, modifying a has no side-effects outside the method.

    The whole IL2CPP thing here is a bit annoying as well - the community has been pretty clear about how the Vector3 class could be improved a bunch pretty trivially, and the answer is always "well, it'll be optimized in IL2CPP, which you should build with".

    Which is like... we still use Mono in the editor! Editor performance is important as well!

    Edit: beat me to it. Let me actually check if that performance improvement is as trivial as I seem to remember that it is.
     
  33. JoshPeterson

    JoshPeterson

    Unity Technologies

    Joined:
    Jul 21, 2014
    Posts:
    6,931
    Sorry, I was not aware that this is the stance anyone at Unity was taking. At least from the VM team side, this is not the case.

    As I mentioned above, check out the latest Unity 2020.2 alpha releases. We've made some improvements to Vector3 and other math operations. We're open to making more as well. These improvements help across IL2CPP and Mono.
     
    Baste likes this.
  34. Peter77

    Peter77

    QA Jesus

    Joined:
    Jun 12, 2013
    Posts:
    6,609
    The generated IL2CPP code looks quite inefficient. Why all the pointer operations, memory shoveling and function calls?
     
    Last edited: Jun 10, 2020
    Aka_ToolBuddy likes this.
  35. Aka_ToolBuddy

    Aka_ToolBuddy

    Joined:
    Feb 25, 2014
    Posts:
    543
    Also another answer I frequently get, including from a Unity dev, is to use Unity.Mathematics. I never liked that answer because it assumes that changing your whole code base to use that lib is something trivial

    Here is an extract of Frame Rate Booster's description:

    How much frame rate increase should I expect?
    It depends on how heavily your code relies on operations on vectors, quaternions and similar objects. The more such operations there are, the better the optimization will be.
    * On benchmarks, I had a 10% increase.
    * On my other asset, Curvy Splines, I got also a 10% increase for operations like mesh generation and splines cache building.
    * On games doing thousands of geometry operations per frame (like moving a lot of objects), I expect a few percent increase at most. Not too much, but hey, it's free!
    * On the remaining situations, I don't expect any noticeable increase.
     
    Last edited: Sep 13, 2021
  36. Aka_ToolBuddy

    Aka_ToolBuddy

    Joined:
    Feb 25, 2014
    Posts:
    543
    I didn't looked at the output assembly code to compare, but from running both of them, the build using the optimized C# implementation was 50% slower than the other in a test build that does only vector3 additions.
     
  37. Aka_ToolBuddy

    Aka_ToolBuddy

    Joined:
    Feb 25, 2014
    Posts:
    543
    @JoshPeterson @Peter77
    Wouldn't this code work and be faster?
    Code (CSharp):
    1.  
    2.        Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720  L_1 = ___a0;
    3.        float L_2 = L_1.get_x_0();    
    4.        Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720  L_3 = ___b1;
    5.        float L_4 = L_3.get_x_0();
    6.        ___a0.set_x_0((float)il2cpp_codegen_add((float)L_2, (float)L_4);
    instead of

    Code (CSharp):
    1.        float* L_0 = (&___a0)->get_address_of_x_0();
    2.        float* L_1 = L_0;
    3.        float L_2 = *((float*)L_1);
    4.        Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720  L_3 = ___b1;
    5.        float L_4 = L_3.get_x_0();
    6.        *((float*)L_1) = (float)((float)il2cpp_codegen_add((float)L_2, (float)L_4));
    Thanks
     
  38. Baste

    Baste

    Joined:
    Jan 24, 2013
    Posts:
    6,334
    Yeah, sorry, I was being a bit over the top there. Not meaning to make any assumptions! It's just that a lot of the replies we get when we complain about perf is on the form "it's faster with IL2CPP/burst/builds, did you test that?". That's often really annoying, since when we're running in editor, none of those matter. Except maybe burst sometimes?

    About Vector3 specifically, I think Frame Rate Booster linked above shows some pretty clear possible improvements. I cooked up a little test, and when comparing Vector3.LerpUnclamped between Unity's implementation and their implementation, it seems like their implementation takes about 80-90% of the time Unity's does. The "trick" in most of the optimizations is simply reusing the input argument as the output.

    Tests done on 2020.2.0a13:

    Code (csharp):
    1.  
    2. // Vector3 copy with Unity's and FrameRateBooster's version:
    3. using System.Runtime.CompilerServices;
    4. using System.Runtime.InteropServices;
    5.  
    6. [StructLayout(LayoutKind.Sequential)]
    7. public struct MyVector3 {
    8.  
    9.     // X component of the vector.
    10.     public float x;
    11.     // Y component of the vector.
    12.     public float y;
    13.     // Z component of the vector.
    14.     public float z;
    15.  
    16.     // Creates a new vector with given x, y, z components.
    17.     [MethodImpl(MethodImplOptions.AggressiveInlining)]
    18.     public MyVector3(float x, float y, float z) { this.x = x; this.y = y; this.z = z; }
    19.  
    20.     // Linearly interpolates between two vectors without clamping the interpolant
    21.     [MethodImpl(MethodImplOptions.AggressiveInlining)]
    22.     public static MyVector3 LerpUnclamped(MyVector3 a, MyVector3 b, float t) // Copied from the C# reference
    23.     {
    24.         return new MyVector3(
    25.             a.x + (b.x - a.x) * t,
    26.             a.y + (b.y - a.y) * t,
    27.             a.z + (b.z - a.z) * t
    28.         );
    29.     }
    30.  
    31.     // Linearly interpolates between two vectors without clamping the interpolant
    32.     [MethodImpl(MethodImplOptions.AggressiveInlining)]
    33.     public static MyVector3 LerpUnclamped_2(MyVector3 a, MyVector3 b, float t) // Copied from Frame Rate Booster
    34.     {
    35.         a.x += (b.x - a.x) * t;
    36.         a.y += (b.y - a.y) * t;
    37.         a.z += (b.z - a.z) * t;
    38.         return a;
    39.     }
    40. }
    41.  
    42. /// Test script.
    43.  
    44. using TMPro;
    45. using UnityEngine;
    46. using UnityEngine.Profiling;
    47.  
    48. public class TestPerf : MonoBehaviour
    49. {
    50.     public TextMeshProUGUI result;
    51.  
    52.     void Update()
    53.     {
    54.         MyVector3 a = new MyVector3(Random.value, Random.value, Random.value);
    55.         MyVector3 b = new MyVector3(Random.value, Random.value, Random.value);
    56.         float t = Random.value;
    57.  
    58.         Profiler.BeginSample("Unity");
    59.         var result_a = 0f;
    60.         for (int i = 0; i < 10000000; i++)
    61.         {
    62.             var lerped = MyVector3.LerpUnclamped(a, b, t);
    63.             result_a += lerped.x + lerped.y + lerped.z; // paranoid about compiler optimizing away stuff!
    64.         }
    65.         Profiler.EndSample(); // Profile Analyzer: Mean is 229.60
    66.  
    67.         Profiler.BeginSample("FrameRateBooster");
    68.         var result_b = 0f;
    69.         for (int i = 0; i < 10000000; i++)
    70.         {
    71.             var lerped = MyVector3.LerpUnclamped_2(a, b, t);
    72.             result_b += lerped.x + lerped.y + lerped.z; // paranoid about compiler optimizing away stuff!
    73.         }
    74.         Profiler.EndSample(); // Profile Analyzer: Mean is 186.99
    75.  
    76.         result.text = $"(do not optimize: {result_a + result_b}!)";
    77.     }
    78. }

    I'm getting the same kinds of results when not attaching the profiler and just printing the results to screen. The methodology isn't exactly perfect, but it looks like there's a real, tangible difference, so it's for sure worthwhile to look into.
     
    glenneroo and Aka_ToolBuddy like this.
  39. Aka_ToolBuddy

    Aka_ToolBuddy

    Joined:
    Feb 25, 2014
    Posts:
    543
    Thanks, that avoided me doing the tests myself
     
  40. Aka_ToolBuddy

    Aka_ToolBuddy

    Joined:
    Feb 25, 2014
    Posts:
    543
    I can tell you from users feedback that some of them feel a real difference. I would have loved that users using IL2CPP, which is usually the default choice for people seeking performance, can't use Frame Rate Booster for now.
     
  41. Peter77

    Peter77

    QA Jesus

    Joined:
    Jun 12, 2013
    Posts:
    6,609
    You wouldn't write that code by hand and even for generated code, I wonder what the rational thought behind "let's do it this way" was. :)

    Maybe all this is genius. The C++ compiler can work in mysterious ways and it can differ greatly between platforms. One had to look at the disassembly to see what instructions it generates. Maybe the IL2CPP generated code gets optimized perfectly by the compiler and that's why it looks like that. ;)

    But I would assume this could unnecessarily move 3 floats around:
    Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720  L_1 = ___a0;


    And this could unnecessarily cause a function call:
    float L_2 = L_1.get_x_0();


    In C++, I would just write something very equivalent to the C# version:
    Code (csharp):
    1. inline const Vector3 operator+ (const Vector3 &a, const Vector3 &b)
    2. {
    3.     Vector3 v;
    4.     v.x = a.x + b.x;
    5.     v.y = a.y + b.y;
    6.     v.z = a.z + b.z;
    7.     return v;
    8.     // alternatively: return Vector3(a.x+b.x,a.y+b.y,a.z+b.z);
    9. }
    This looks less complicated and can probably better optimized by the C++ toolchain.
     
  42. JoshPeterson

    JoshPeterson

    Unity Technologies

    Joined:
    Jul 21, 2014
    Posts:
    6,931
    I think there might be two different issues here we are discussing:
    1. Is IL2CPP generating the best code possible regarding the current Vector3 implementation?
    2. Can the Vector3 C# implementation be improved to provide better performance?
    TL;DR

    For (1) I believe the answer is yes. We can look at generated IL and assembly code to explore this. For (2), I believe the answer is also yes, but this exploration requires profiling, which is a bit more difficult to do on the forums.

    Let's start with (1). Here is the example code I've looked at. I've tried to get the Unity Vector3 type represented in a standalone .NET executable here:

    Code (CSharp):
    1. using System;
    2.  
    3. namespace ConsoleApp1
    4. {
    5.     class Program
    6.     {
    7.         public struct Vector3
    8.         {
    9.             public float x;
    10.             public float y;
    11.             public float z;
    12.  
    13.             public Vector3(float x, float y, float z) { this.x = x; this.y = y; this.z = z; }
    14.  
    15.             public static Vector3 Add1(Vector3 a, Vector3 b)
    16.             {
    17.                 return new Vector3(a.x + b.x, a.y + b.y, a.z + b.z);
    18.             }
    19.  
    20.             public static Vector3 Add2(Vector3 a, Vector3 b)
    21.             {
    22.                a.x += b.x;
    23.                a.y += b.y;
    24.                a.z += b.z;
    25.                return a;
    26.             }
    27.  
    28.             static readonly Vector3 zeroVector = new Vector3(0F, 0F, 0F);
    29.         }
    30.  
    31.         static void Main(string[] args)
    32.         {
    33.             var a = new Vector3(1.0f, 2.0f, 3.0f);
    34.             var b = new Vector3(4.0f, 5.0f, 6.0f);
    35.  
    36.             var result1 = Vector3.Add1(a, b);
    37.             var result2 = Vector3.Add2(a, b);
    38.  
    39.             Console.WriteLine($"result1: {result1.x}, {result1.y}, {result1.z}");
    40.             Console.WriteLine($"result2: {result2.x}, {result2.y}, {result2.z}");
    41.         }
    42.     }
    The Add1 method is pretty much the current Unity Vector3 implementation. The Add2 method is the better implementation proposed by @Aka_ToolBuddy.

    Let's start with Add1.

    Here is the IL code (using ILSpy):

    Code (CSharp):
    1. .method public hidebysig static
    2.             valuetype ConsoleApp1.Program/Vector3 Add1 (
    3.                 valuetype ConsoleApp1.Program/Vector3 a,
    4.                 valuetype ConsoleApp1.Program/Vector3 b
    5.             ) cil managed
    6.         {
    7.             // Method begins at RVA 0x2115
    8.             // Code size 45 (0x2d)
    9.             .maxstack 8
    10.  
    11.             // return new Vector3(a.x + b.x, a.y + b.y, a.z + b.z);
    12.             IL_0000: ldarg.0
    13.             IL_0001: ldfld float32 ConsoleApp1.Program/Vector3::x
    14.             IL_0006: ldarg.1
    15.             IL_0007: ldfld float32 ConsoleApp1.Program/Vector3::x
    16.             IL_000c: add
    17.             IL_000d: ldarg.0
    18.             IL_000e: ldfld float32 ConsoleApp1.Program/Vector3::y
    19.             IL_0013: ldarg.1
    20.             IL_0014: ldfld float32 ConsoleApp1.Program/Vector3::y
    21.             IL_0019: add
    22.             IL_001a: ldarg.0
    23.             IL_001b: ldfld float32 ConsoleApp1.Program/Vector3::z
    24.             IL_0020: ldarg.1
    25.             IL_0021: ldfld float32 ConsoleApp1.Program/Vector3::z
    26.             IL_0026: add
    27.             IL_0027: newobj instance void ConsoleApp1.Program/Vector3::.ctor(float32, float32, float32)
    28.             // (no C# code)
    29.             IL_002c: ret
    30.         } // end of method Vector3::Add1
    Here is the C++ generated by IL2CPP:

    Code (CSharp):
    1.  
    2. // ConsoleApp1.Program/Vector3 ConsoleApp1.Program/Vector3::Add1(ConsoleApp1.Program/Vector3,ConsoleApp1.Program/Vector3)
    3. IL2CPP_EXTERN_C IL2CPP_METHOD_ATTR Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874  Vector3_Add1_m1B0E5B87661EFBBFC00A2B53997A3DE3BD26A88C (Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874  ___a0, Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874  ___b1, const RuntimeMethod* method)
    4. {
    5.         {
    6.                 // return new Vector3(a.x + b.x, a.y + b.y, a.z + b.z);
    7.                 Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874  L_0 = ___a0;
    8.                 float L_1 = L_0.get_x_0();
    9.                 Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874  L_2 = ___b1;
    10.                 float L_3 = L_2.get_x_0();
    11.                 Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874  L_4 = ___a0;
    12.                 float L_5 = L_4.get_y_1();
    13.                 Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874  L_6 = ___b1;
    14.                 float L_7 = L_6.get_y_1();
    15.                 Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874  L_8 = ___a0;
    16.                 float L_9 = L_8.get_z_2();
    17.                 Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874  L_10 = ___b1;
    18.                 float L_11 = L_10.get_z_2();
    19.                 Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874  L_12;
    20.                 memset((&L_12), 0, sizeof(L_12));
    21.                 Vector3__ctor_mC283B125D085C9C6206E78FF497393B52E64F032((&L_12), ((float)il2cpp_codegen_add((float)L_1, (float)L_3)), ((float)il2cpp_codegen_add((float)L_5, (float)L_7)), ((float)il2cpp_codegen_add((float)L_9, (float)L_11)), /*hidden argument*/NULL);
    22.                 return L_12;
    23.         }
    24. }
    25.  
    Here is the x64 assembly generated for a release build with Visual Studio 2019:

    Code (CSharp):
    1. Vector3_Add1_m1B0E5B87661EFBBFC00A2B53997A3DE3BD26A88C PROC ; COMDAT
    2. ; File C:\Users\joshu\AppData\Local\Temp\il2cpp\il2cpp__ConsoleApp1\generatedcpp\ConsoleApp1.cpp
    3. ; Line 397
    4. $LN18:
    5.         sub     rsp, 56                                 ; 00000038H
    6. ; Line 408
    7.         mov     eax, DWORD PTR [rdx+8]
    8. ; File C:\code\il2cpp\libil2cpp\codegen\il2cpp-codegen-common-small.h
    9. ; Line 56
    10.         movss   xmm0, DWORD PTR [rdx]
    11.         addss   xmm0, DWORD PTR [r8]
    12.         movss   xmm1, DWORD PTR [rdx+4]
    13.         addss   xmm1, DWORD PTR [r8+4]
    14. ; File C:\Users\joshu\AppData\Local\Temp\il2cpp\il2cpp__ConsoleApp1\generatedcpp\ConsoleApp1.cpp
    15. ; Line 408
    16.         mov     DWORD PTR L_8$1[rsp+8], eax
    17. ; Line 410
    18.         mov     eax, DWORD PTR [r8+8]
    19. ; Line 143
    20.         movss   DWORD PTR [rcx], xmm0
    21. ; File C:\code\il2cpp\libil2cpp\codegen\il2cpp-codegen-common-small.h
    22. ; Line 56
    23.         movss   xmm0, DWORD PTR L_8$1[rsp+8]
    24. ; File C:\Users\joshu\AppData\Local\Temp\il2cpp\il2cpp__ConsoleApp1\generatedcpp\ConsoleApp1.cpp
    25. ; Line 410
    26.         mov     DWORD PTR L_10$2[rsp+8], eax
    27. ; Line 415
    28.         mov     rax, rcx
    29. ; File C:\code\il2cpp\libil2cpp\codegen\il2cpp-codegen-common-small.h
    30. ; Line 56
    31.         addss   xmm0, DWORD PTR L_10$2[rsp+8]
    32. ; File C:\Users\joshu\AppData\Local\Temp\il2cpp\il2cpp__ConsoleApp1\generatedcpp\ConsoleApp1.cpp
    33. ; Line 381
    34.         movss   DWORD PTR [rcx+4], xmm1
    35. ; Line 384
    36.         movss   DWORD PTR [rcx+8], xmm0
    37. ; Line 417
    38.         add     rsp, 56                                 ; 00000038H
    39.         ret     0
    40. Vector3_Add1_m1B0E5B87661EFBBFC00A2B53997A3DE3BD26A88C ENDP
    Now let's look at Add2:

    First the IL code:

    Code (CSharp):
    1. .method public hidebysig static
    2.             valuetype ConsoleApp1.Program/Vector3 Add2 (
    3.                 valuetype ConsoleApp1.Program/Vector3 a,
    4.                 valuetype ConsoleApp1.Program/Vector3 b
    5.             ) cil managed
    6.         {
    7.             // Method begins at RVA 0x2143
    8.             // Code size 53 (0x35)
    9.             .maxstack 8
    10.  
    11.             // a.x += b.x;
    12.             IL_0000: ldarga.s a
    13.             IL_0002: ldflda float32 ConsoleApp1.Program/Vector3::x
    14.             IL_0007: dup
    15.             IL_0008: ldind.r4
    16.             IL_0009: ldarg.1
    17.             IL_000a: ldfld float32 ConsoleApp1.Program/Vector3::x
    18.             IL_000f: add
    19.             // (no C# code)
    20.             IL_0010: stind.r4
    21.             // a.y += b.y;
    22.             IL_0011: ldarga.s a
    23.             IL_0013: ldflda float32 ConsoleApp1.Program/Vector3::y
    24.             IL_0018: dup
    25.             IL_0019: ldind.r4
    26.             IL_001a: ldarg.1
    27.             IL_001b: ldfld float32 ConsoleApp1.Program/Vector3::y
    28.             IL_0020: add
    29.             // (no C# code)
    30.             IL_0021: stind.r4
    31.             // a.z += b.z;
    32.             IL_0022: ldarga.s a
    33.             IL_0024: ldflda float32 ConsoleApp1.Program/Vector3::z
    34.             IL_0029: dup
    35.             IL_002a: ldind.r4
    36.             IL_002b: ldarg.1
    37.             IL_002c: ldfld float32 ConsoleApp1.Program/Vector3::z
    38.             IL_0031: add
    39.             // (no C# code)
    40.             IL_0032: stind.r4
    41.             // return a;
    42.             IL_0033: ldarg.0
    43.             // (no C# code)
    44.             IL_0034: ret
    45.         } // end of method Vector3::Add2
    Now the generated C++ code from IL2CPP:

    Code (CSharp):
    1. // ConsoleApp1.Program/Vector3 ConsoleApp1.Program/Vector3::Add2(ConsoleApp1.Program/Vector3,ConsoleApp1.Program/Vector3)
    2. IL2CPP_EXTERN_C IL2CPP_METHOD_ATTR Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874  Vector3_Add2_mFC078D81430196FD3B0A2A4675EA50446FE3A0CF (Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874  ___a0, Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874  ___b1, const RuntimeMethod* method)
    3. {
    4.         {
    5.                 // a.x += b.x;
    6.                 float* L_0 = (&___a0)->get_address_of_x_0();
    7.                 float* L_1 = L_0;
    8.                 float L_2 = *((float*)L_1);
    9.                 Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874  L_3 = ___b1;
    10.                 float L_4 = L_3.get_x_0();
    11.                 *((float*)L_1) = (float)((float)il2cpp_codegen_add((float)L_2, (float)L_4));
    12.                 // a.y += b.y;
    13.                 float* L_5 = (&___a0)->get_address_of_y_1();
    14.                 float* L_6 = L_5;
    15.                 float L_7 = *((float*)L_6);
    16.                 Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874  L_8 = ___b1;
    17.                 float L_9 = L_8.get_y_1();
    18.                 *((float*)L_6) = (float)((float)il2cpp_codegen_add((float)L_7, (float)L_9));
    19.                 // a.z += b.z;
    20.                 float* L_10 = (&___a0)->get_address_of_z_2();
    21.                 float* L_11 = L_10;
    22.                 float L_12 = *((float*)L_11);
    23.                 Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874  L_13 = ___b1;
    24.                 float L_14 = L_13.get_z_2();
    25.                 *((float*)L_11) = (float)((float)il2cpp_codegen_add((float)L_12, (float)L_14));
    26.                 // return a;
    27.                 Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874  L_15 = ___a0;
    28.                 return L_15;
    29.         }
    30. }
    And finally the assembly code:

    Code (CSharp):
    1. Vector3_Add2_mFC078D81430196FD3B0A2A4675EA50446FE3A0CF PROC ; COMDAT
    2. ; File C:\Users\joshu\AppData\Local\Temp\il2cpp\il2cpp__ConsoleApp1\generatedcpp\ConsoleApp1.cpp
    3. ; Line 420
    4. $LN12:
    5.         sub     rsp, 24
    6. ; File C:\code\il2cpp\libil2cpp\codegen\il2cpp-codegen-common-small.h
    7. ; Line 56
    8.         movss   xmm0, DWORD PTR [rdx]
    9.         addss   xmm0, DWORD PTR [r8]
    10. ; File C:\Users\joshu\AppData\Local\Temp\il2cpp\il2cpp__ConsoleApp1\generatedcpp\ConsoleApp1.cpp
    11. ; Line 433
    12.         mov     eax, DWORD PTR [r8+8]
    13.         mov     DWORD PTR L_8$2[rsp+8], eax
    14. ; Line 440
    15.         mov     DWORD PTR L_13$1[rsp+8], eax
    16.         movss   DWORD PTR [rdx], xmm0
    17. ; File C:\code\il2cpp\libil2cpp\codegen\il2cpp-codegen-common-small.h
    18. ; Line 56
    19.         movss   xmm0, DWORD PTR [r8+4]
    20.         addss   xmm0, DWORD PTR [rdx+4]
    21. ; File C:\Users\joshu\AppData\Local\Temp\il2cpp\il2cpp__ConsoleApp1\generatedcpp\ConsoleApp1.cpp
    22. ; Line 435
    23.         movss   DWORD PTR [rdx+4], xmm0
    24. ; File C:\code\il2cpp\libil2cpp\codegen\il2cpp-codegen-common-small.h
    25. ; Line 56
    26.         movss   xmm0, DWORD PTR L_13$1[rsp+8]
    27.         addss   xmm0, DWORD PTR [rdx+8]
    28. ; File C:\Users\joshu\AppData\Local\Temp\il2cpp\il2cpp__ConsoleApp1\generatedcpp\ConsoleApp1.cpp
    29. ; Line 442
    30.         movss   DWORD PTR [rdx+8], xmm0
    31. ; Line 444
    32.         mov     eax, DWORD PTR [rdx+8]
    33.         movsd   xmm0, QWORD PTR [rdx]
    34.         movsd   QWORD PTR [rcx], xmm0
    35.         mov     DWORD PTR [rcx+8], eax
    36. ; Line 445
    37.         mov     rax, rcx
    38. ; Line 447
    39.         add     rsp, 24
    40.         ret     0
    41. Vector3_Add2_mFC078D81430196FD3B0A2A4675EA50446FE3A0CF ENDP
    Ok, so that is lot!

    The key thing to understand about IL2CPP is that it is not really making many optimizations. Its job is to translate the IL code into C++ code, then it relies on the C++ compiler to optimize it.

    The generated assembly code for the two implementations looks pretty similar, and this makes me happy - in the end we want to add two vectors, so hopefully the code that ends up running is pretty much the same. Note especially that in neither case does the generated assembly code call any functions - everything is inlined. That likely happens because I'm using a simple example here, with everything on one assembly. Once Vector3 is in a different assembly (as it is in Unity), inlining becomes more complex, and those extra function calls might happen and might matter. This is one of the performance issues we addressed recently. The Vector3 operations should all be inlined now, even when they are in a separate assembly.

    I guess the bottom line point here is that the seemingly complex code that IL2CPP generates is there for a reason - it stems from the IL code. But in the end, it looks like it does not matter too much for the generated assembly code.

    Ok, now for point (2).

    All that really matters is the real performance, right? Many of the suggestions here seems to have a real performance benefit (Thanks for the profiling code @Baste!). So we will take some of these suggestions, make the changes, and run them through our performance tests to see what happens.

    I think that we can make some improvements here, but I'm not ready to make definitive statements yet because performance analysis is pretty complex!
     
    EZaca, Bunny83, Noisecrime and 3 others like this.
  43. Aka_ToolBuddy

    Aka_ToolBuddy

    Joined:
    Feb 25, 2014
    Posts:
    543
    Thanks a lot for taking time to answer. Here is my answer to various of your points:

    That's a valid question, and answering it has it benefits, but it is not what I am focused on right now.

    From my analysis of the IL instructions, my personal tests, the tests of other people and users of Frame Rate Booster for a couple of years, I am convinced the answer is yes. I get that changing anything in Unity is a big responsability, and you need to do your own tests. There is nothing wrong about that. And if you do end up including these changes in Unity, please consider all the other optims of the same kind you can find in Frame Rate Booster

    For me, the issue I am trying to solve is neither point 1 or 2, but a point 3 which is "can IL2CPP translate the custom optimized Vector3 C# implementation in a more efficient way? like suggested here"

    The IL code you posted is a bit different than mine. Please when exploring this subject, take a look also at what Frame Rate Booster produces, and what has been posted here

    The limitation of my expertise in c++ stops me from contradicting you, but in my tests the two versions did not run at a similar frame rate, one being 50% slower than the other. Maybe you will encounter this once you will test within the actual context of Unity, maybe not, I don't know. If you want, I can PM you the wip version of Frame Rate Booster that is compatible with IL2CPP so you can hopefully see the slowdowns I encountered.

    If the point just above is solved, then I completely agree, the cpp code can be complex, as long as the c++ compiler handles it it's ok. But in the other case, are we agreeing that it is possible to automatically translate the same IL instructions with less cpp instructions? Like suggested here

    Thanks again for your time and efforts
     
    Noisecrime likes this.
  44. Peter77

    Peter77

    QA Jesus

    Joined:
    Jun 12, 2013
    Posts:
    6,609
    Isolated tests are often misleading from an optimizations point of view. Josh gave the example with method calls that are inlined in his test. It would be more meaningful if it's tested with an entire game with non trivial complexity.
     
    Aka_ToolBuddy likes this.
  45. JoshPeterson

    JoshPeterson

    Unity Technologies

    Joined:
    Jul 21, 2014
    Posts:
    6,931
    No, IL2CPP cannot translate this code more efficiently. The IL code specifically requires that the address of each field be used.
     
  46. JoshPeterson

    JoshPeterson

    Unity Technologies

    Joined:
    Jul 21, 2014
    Posts:
    6,931
    Interesting - I missed the IL code post earlier in this thread. Thanks for pointing it out.

    After looking at it a bit, it is slightly different, but still pretty close in both cases. Anyway, we will make the changes and profile to see what happens.
     
    PraetorBlue and Peter77 like this.
  47. JoshPeterson

    JoshPeterson

    Unity Technologies

    Joined:
    Jul 21, 2014
    Posts:
    6,931
    This is kind of what I was getting at. Prior to Unity 2020.2 (I don't recall the exact version), many of these Vector3 math operations were not inlined, but now they are.
     
    Peter77 likes this.
  48. JoshPeterson

    JoshPeterson

    Unity Technologies

    Joined:
    Jul 21, 2014
    Posts:
    6,931
    Yes, I completely believe this. Performance is complex, and just looking at generated assembly code is not usually enough to understand it. I just wanted to point about that the place for improvement here is in the C+ code, not in the way that IL2CPP translates it.

    If that said, the performance improvement is the most important part, so that is where we need to focus.
     
    Noisecrime likes this.
  49. Aka_ToolBuddy

    Aka_ToolBuddy

    Joined:
    Feb 25, 2014
    Posts:
    543
    You convinced me. I took more time to dig deeper in this, and I am convinced. Thanks for your patience with me.
     
    atomicjoe and Noisecrime like this.
  50. Aka_ToolBuddy

    Aka_ToolBuddy

    Joined:
    Feb 25, 2014
    Posts:
    543
    To make it easier for everyone, here is the IL instructions based on Unity 2019.3.6f1 of:

    The default addition implementation
    Code (CSharp):
    1. .method public hidebysig specialname static
    2.     valuetype UnityEngine.Vector3 op_Addition (
    3.         valuetype UnityEngine.Vector3 a,
    4.         valuetype UnityEngine.Vector3 b
    5.     ) cil managed
    6. {
    7.     // Method begins at RVA 0x5270
    8.     // Code size 50 (0x32)
    9.     .maxstack 4
    10.     .locals init (
    11.         [0] valuetype UnityEngine.Vector3
    12.     )
    13.  
    14.     IL_0000: nop
    15.     IL_0001: ldarg.0
    16.     IL_0002: ldfld float32 UnityEngine.Vector3::x
    17.     IL_0007: ldarg.1
    18.     IL_0008: ldfld float32 UnityEngine.Vector3::x
    19.     IL_000d: add
    20.     IL_000e: ldarg.0
    21.     IL_000f: ldfld float32 UnityEngine.Vector3::y
    22.     IL_0014: ldarg.1
    23.     IL_0015: ldfld float32 UnityEngine.Vector3::y
    24.     IL_001a: add
    25.     IL_001b: ldarg.0
    26.     IL_001c: ldfld float32 UnityEngine.Vector3::z
    27.     IL_0021: ldarg.1
    28.     IL_0022: ldfld float32 UnityEngine.Vector3::z
    29.     IL_0027: add
    30.     IL_0028: newobj instance void UnityEngine.Vector3::.ctor(float32, float32, float32)
    31.     IL_002d: stloc.0
    32.     IL_002e: br.s IL_0030
    33.  
    34.     IL_0030: ldloc.0
    35.     IL_0031: ret
    36. } // end of method Vector3::op_Addition
    37.  
    Frame Rate Booster's one
    Code (CSharp):
    1. .method public hidebysig specialname static
    2.     valuetype UnityEngine.Vector3 op_Addition (
    3.         valuetype UnityEngine.Vector3 a,
    4.         valuetype UnityEngine.Vector3 b
    5.     ) cil managed
    6. {
    7.     // Method begins at RVA 0x5270
    8.     // Code size 53 (0x35)
    9.     .maxstack 3
    10.  
    11.     IL_0000: ldarga.s a
    12.     IL_0002: ldflda float32 UnityEngine.Vector3::x
    13.     IL_0007: dup
    14.     IL_0008: ldind.r4
    15.     IL_0009: ldarg.1
    16.     IL_000a: ldfld float32 UnityEngine.Vector3::x
    17.     IL_000f: add
    18.     IL_0010: stind.r4
    19.     IL_0011: ldarga.s a
    20.     IL_0013: ldflda float32 UnityEngine.Vector3::y
    21.     IL_0018: dup
    22.     IL_0019: ldind.r4
    23.     IL_001a: ldarg.1
    24.     IL_001b: ldfld float32 UnityEngine.Vector3::y
    25.     IL_0020: add
    26.     IL_0021: stind.r4
    27.     IL_0022: ldarga.s a
    28.     IL_0024: ldflda float32 UnityEngine.Vector3::z
    29.     IL_0029: dup
    30.     IL_002a: ldind.r4
    31.     IL_002b: ldarg.1
    32.     IL_002c: ldfld float32 UnityEngine.Vector3::z
    33.     IL_0031: add
    34.     IL_0032: stind.r4
    35.     IL_0033: ldarg.0
    36.     IL_0034: ret
    37. } // end of method Vector3::op_Addition
    Both are very similar to your result