Search Unity

Search

Vector3 and other structs optimization of operators

Discussion in 'Scripting' started by Aka_ToolBuddy, Jun 17, 2017.

Page 1 of 2

Aka_ToolBuddy

Joined:

Feb 25, 2014

Posts:

543
Hi,
While working on some real time mesh generating code, that did hundreds of thousands of Vector3 operations per frame, I was surprised to find that Vector3 (among other Unity structs) operators (*, +, ...) can be easily and massively optimized.

The current implementation of the * operator in Vector3 is:

Code (CSharp):

public static Vector3 operator *(Vector3 a, float d)

{

return new Vector3(a.x * d, a.y * d, a.z * d);

}

The optimized implementation I suggest is:

Code (CSharp):

public static Vector3 operator *(Vector3 a, float d)

{

Vector3 result;

result.x = a.x * d;

result.y = a.y * d;

result.z = a.z * d;

return result;

}

I run some simple comparison test that you can find here https://dropb.in/ponda.nimrod, and the result was:
When run 50000 times, the current Unity's operator took 18.9 ms to execute, while the optimized one took 2.5 ms.
The reason behind this difference is that the optimized version avoids calling unnecessarily the Vector3 constructor.

I opened a suggestion at Unity's feedback site, so please support it by voting for it so we can see this optimization integrated in Unity some day
https://feedback.unity3d.com/suggestions/vector3-and-other-structs-optimization-of-operators

Edit: The unity feedback website is no more. You can find these optims implemented in my asset called Frame Rate Booster

Thanks and have a nice day.
Last edited: Jun 9, 2020

Aka_ToolBuddy, Jun 17, 2017

#1

KyryloKuzyk, Apeles, AndreIvankio and 14 others like this.
ThomasTrenkwalder

Joined:

Jun 18, 2017

Posts:

10

Interesting, I never thought about checking the performance of unitys math library stuff.
I just tried out the operator you mentioned, and I do indeed get better performance when I roll my own struct and implement the operator as you suggest.
Unitys Vector3 seems to take 30% more time for me, both inside the editor and inside a build (on 5.6.1f1).

ThomasTrenkwalder, Jun 18, 2017

#2

Aka_ToolBuddy likes this.
Aka_ToolBuddy

Joined:

Feb 25, 2014

Posts:

543

Thanks a lot ThomasTrenkwalder for your tests, and thanks for voting for my suggestion, I hope it will make the Unity team consider the suggestion.

Aka_ToolBuddy, Jun 18, 2017

#3
CrystalConflux

Joined:

May 25, 2017

Posts:

107

Interesting. Considering that the current implementation is less verbose, and all that the constructor does is assign the corresponding fields, I wonder why the C# compiler doesn't optimize this?

Have you tested this in standalone release mode? Maybe it only affects debug mode?

ThomasTrenkwalder said: ↑

Unitys Vector3 seems to take 30% more time for me, both inside the editor and inside a build (on 5.6.1f1).
Click to expand...

When you tested it in standalone did you disable development build/script debugging?

Last edited: Jun 18, 2017

CrystalConflux, Jun 18, 2017

#4

Bunny83 likes this.
Aka_ToolBuddy

Joined:

Feb 25, 2014

Posts:

543

CrystalConflux said: ↑

I wonder why the C# compiler doesn't optimize this?
Click to expand...

I wonder the same thing. It seems to me to be something the compiler can handle, but I suppose things are more complicated than what I imagine.

CrystalConflux said: ↑

When you tested it in standalone did you disable development build/script debugging?
Click to expand...

I confirm the optimization works in those conditions as well. I used a heavier version of the script above, and used Fraps to get the FPS count (to exclude any Unity's profiler possible issue), and here are the results:
- Optimized version: 53 FPS
- Unoptimized version: 42 FPS

Aka_ToolBuddy, Jun 18, 2017

#5
Rick-Gamez

Joined:

Mar 23, 2015

Posts:

218

Wow I didn't realize that running the constructor in this case would make that big of difference. (I'm self taught BTW) but thanks for this insight. I will keep this in mind when developing my stuff. Thanks for the info!

Rick-Gamez, Jun 18, 2017

#6
lordofduct

Joined:

Oct 3, 2011

Posts:

8,531

yep, a constructor function is just that... a function.

So it allocates a stack frame to call it.

If you don't call the constructor though, it just allocates the memory needed for the struct with empty values.

This is why struct's don't allow field initializers, they MUST be empty values. Where as classes always have a constructor phase, so it doesn't have this restriction.

...

I find this a minor optimization, probably resulting from early Unity. I bet it came about because the unity devs were all C++ programmers first and foremost, and so didn't really consider the inner workings of the mono CLR. But it is a area of optimization that could potentially give a little oomph since vector construction is very common.

lordofduct, Jun 18, 2017

#7

Rick-Gamez likes this.
Rick-Gamez

Joined:

Mar 23, 2015

Posts:

218

Yeah I knew that the constructor is a glorified method basically but that sheds some light on how C# allocates it's frame steps so thank you for that info!

Rick-Gamez, Jun 18, 2017

#8
ThomasTrenkwalder

Joined:

Jun 18, 2017

Posts:

10

CrystalConflux said: ↑

When you tested it in standalone did you disable development build/script debugging?
Click to expand...

Yup, no development build here. I measured the times using the .NET Stopwatch class.
One would think that the compilers (either the C# one or the JIT) should be able to inline this constructor call, but apparently they just don't.

Considering that working with Vector3s and other math structs is quite common in many games, optimizing these operators would provide a nice benefit, and it doesn't even look like a lot of work ^^

ThomasTrenkwalder, Jun 18, 2017

#9
Aka_ToolBuddy

Joined:

Feb 25, 2014

Posts:

543

ThomasTrenkwalder said: ↑

optimizing these operators would provide a nice benefit, and it doesn't even look like a lot of work ^^
Click to expand...

I completely agree. I think that implementing these optimizations could be done in less than a man-day.

Rick-Gamez said: ↑

Thanks for the info!
Click to expand...

You are welcome And please consider voting for the suggestion to hopefully make Unity's team implement it.
https://feedback.unity3d.com/suggestions/vector3-and-other-structs-optimization-of-operators

Aka_ToolBuddy, Jun 19, 2017

#10
Aka_ToolBuddy

Joined:

Feb 25, 2014

Posts:

543
For those who are interested, here are the IL instructions for the optimized Vector3 multiplication

Code (CSharp):

.method public hidebysig static

valuetype [UnityEngine]UnityEngine.Vector3 Optimized_Multiplication (

valuetype [UnityEngine]UnityEngine.Vector3 a,

float32 d

) cil managed

{

// Method begins at RVA 0x20e0

// Code size 58 (0x3a)

.maxstack 3

.locals init (

[0] valuetype [UnityEngine]UnityEngine.Vector3,

[1] valuetype [UnityEngine]UnityEngine.Vector3

)

IL_0000: nop

IL_0001: ldloca.s 0

IL_0003: ldarga.s a

IL_0005: ldfld float32 [UnityEngine]UnityEngine.Vector3::x

IL_000a: ldarg.1

IL_000b: mul

IL_000c: stfld float32 [UnityEngine]UnityEngine.Vector3::x

IL_0011: ldloca.s 0

IL_0013: ldarga.s a

IL_0015: ldfld float32 [UnityEngine]UnityEngine.Vector3::y

IL_001a: ldarg.1

IL_001b: mul

IL_001c: stfld float32 [UnityEngine]UnityEngine.Vector3::y

IL_0021: ldloca.s 0

IL_0023: ldarga.s a

IL_0025: ldfld float32 [UnityEngine]UnityEngine.Vector3::z

IL_002a: ldarg.1

IL_002b: mul

IL_002c: stfld float32 [UnityEngine]UnityEngine.Vector3::z

IL_0031: ldloc.0

IL_0032: stloc.1

IL_0033: br IL_0038

IL_0038: ldloc.1

IL_0039: ret

} // end of method test::Optimized_Multiplication

and those for the unoptimized one

Code (CSharp):

.method public hidebysig specialname static

valuetype UnityEngine.Vector3 op_Multiply (

valuetype UnityEngine.Vector3 a,

float32 d

) cil managed

{

// Method begins at RVA 0xb5b8

// Code size 41 (0x29)

.maxstack 4

.locals init (

[0] valuetype UnityEngine.Vector3

)

IL_0000: nop

IL_0001: ldarga.s a

IL_0003: ldfld float32 UnityEngine.Vector3::x

IL_0008: ldarg.1

IL_0009: mul

IL_000a: ldarga.s a

IL_000c: ldfld float32 UnityEngine.Vector3::y

IL_0011: ldarg.1

IL_0012: mul

IL_0013: ldarga.s a

IL_0015: ldfld float32 UnityEngine.Vector3::z

IL_001a: ldarg.1

IL_001b: mul

IL_001c: newobj instance void UnityEngine.Vector3::.ctor(float32, float32, float32)

IL_0021: stloc.0

IL_0022: br IL_0027

IL_0027: ldloc.0

IL_0028: ret

} // end of method Vector3::op_Multiply
Aka_ToolBuddy, Jun 19, 2017

#11

glenneroo, CrystalConflux and Rick-Gamez like this.
Invertex

Joined:

Nov 7, 2013

Posts:

1,550
Did a test because I was curious if the same issue would happen with the object initializer {} feature.

Code (CSharp):

public static Vector3 GetSomeVector3()

{

Vector3 vec;

vec.x = 3.4f; vec.y = 2.3f; vec.z = 55.5f;

return vec;

}

IL_0000 nop

IL_0001 ldloca.s vec

IL_0003 ldc.r4 3.4

IL_0008 stfld System.Single UnityEngine.Vector3::x

IL_000D ldloca.s vec

IL_000F ldc.r4 2.3

IL_0014 stfld System.Single UnityEngine.Vector3::y

IL_0019 ldloca.s vec

IL_001B ldc.r4 55.5

IL_0020 stfld System.Single UnityEngine.Vector3::z

IL_0025 ldloc.0

IL_0026 stloc.1

IL_0027 br.s IL_0029

IL_0029 ldloc.1

IL_002A ret

public static Vector3 SomeNewVector3()

{

return new Vector3 {x = 3.4f, y = 2.3f, z = 55.5f };

}

IL_0000 nop

IL_0001 ldloca.s V_0 //Extra Instruction

IL_0003 initobj UnityEngine.Vector3 //Extra Instruction

IL_0009 ldloca.s V_0

IL_000B ldc.r4 3.4

IL_0010 stfld System.Single UnityEngine.Vector3::x

IL_0015 ldloca.s V_0

IL_0017 ldc.r4 2.3

IL_001C stfld System.Single UnityEngine.Vector3::y

IL_0021 ldloca.s V_0

IL_0023 ldc.r4 55.5

IL_0028 stfld System.Single UnityEngine.Vector3::z

IL_002D ldloc.0

IL_002E stloc.1

IL_002F br.s IL_0031

IL_0031 ldloc.1

IL_0032 ret

The object initializer method does also avoid the call to the constructor, but it still has two extra instructions, the important one being an initobj call, which is going to cause a bit of extra work to be done in the form of it initializing all the values of the struct to zero or null. So while that should still be a lot better than the call to the constructor, the local declaration and assignment still wins out.

I'm really surprised the CLR doesn't optimize this initobj call out if it detects you're assigning to every value in the struct.
Invertex, May 31, 2018

#12

glenneroo, bobisgod234 and Peter77 like this.
Aka_ToolBuddy

Joined:

Feb 25, 2014

Posts:

543

Thanks for that extra information. Didn't tough to test that as well.

Aka_ToolBuddy, May 31, 2018

#13
TJHeuvel-net

Joined:

Jul 31, 2012

Posts:

838

https://forum.unity.com/threads/unity-mathematics-available-on-github.526100

They are also making a whole new library.

When working with huge amounts of Vectors, it can be benificial to avoid them at all. Just define three floats, its still a new object that is being created.

TJHeuvel-net, May 31, 2018

#14
Doug_B

Joined:

Jun 4, 2017

Posts:

1,596

Aka_ToolBuddy said: ↑

I opened a suggestion at Unity's feedback site, so please support it by voting for it so we can see this optimization integrated in Unity some day
Click to expand...

I linked to it over on this other thread earlier on. Vote count has gone from 152 to 165 in three hours.

TJHeuvel-net said: ↑

They are also making a whole new library.
Click to expand...

I wonder why they cannot just fix the aforementioned request rather than create a whole new library that you have to know to get and integrate? I appreciate that a release of Unity (which is presumably what would be required) is no small matter. However, this does seem to be such a fundamental part of a 3D platform to reasonably have expectations of an efficient implementation.

But then maybe I am simply missing something here.

Doug_B, May 31, 2018

#15
Invertex

Joined:

Nov 7, 2013

Posts:

1,550

Doug_B said: ↑

I wonder why they cannot just fix the aforementioned request rather than create a whole new library that you have to know to get and integrate? I appreciate that a release of Unity (which is presumably what would be required) is no small matter. However, this does seem to be such a fundamental part of a 3D platform to reasonably have expectations of an efficient implementation.

But then maybe I am simply missing something here.
Click to expand...

You are missing something
That mathematics library isn't the "solution" to this tiny little problem here, it's completely unrelated to it. That mathematics library is designed to help ensure highly efficient compilation of your complex vector/matrix/etc.. math in general, helping it be tightly packed and memory efficient in the burst compiler.
That mathematics library will be integrated in Unity... It's just that it's quite beta right now so people who want to mess with it right now can do so through the repository and also help find bugs or contribute improvements (at some point potentially).

Invertex, May 31, 2018

#16
Doug_B

Joined:

Jun 4, 2017

Posts:

1,596

Invertex said: ↑

That mathematics library isn't the "solution" to this tiny little problem here, it's completely unrelated to it.
Click to expand...

Ah, ok. I've got my wires crossed. That means my vote for improved struct performance may not have been wasted then - assuming that ever gets looked at.

Doug_B, May 31, 2018

#17
Peter77

QA Jesus

Joined:

Jun 12, 2013

Posts:

6,609

I rewrote the IL of some Unity's DLLs and measured performance of a few applications. My conclusion was that Unity Technologies can achieve quite some performance improvements, with very little work, with trivial changes only, without actually changing something in user-code.

Yes, they do provide a new math lib, but to make use of it, you need to change your project. This probably give better performance, but it might also not be a trivial change. Therefore, if Unity would just change some simple code in their Vector classes, every existing Unity project would actually benefit from those changes automagically.

Here are my findings:
https://forum.unity.com/threads/wip...faster-without-any-changes-in-seconds.531169/

Peter77, May 31, 2018

#18

glenneroo and Noisecrime like this.
Doug_B

Joined:

Jun 4, 2017

Posts:

1,596

Peter77 said: ↑

Here are my findings:
Click to expand...

Interesting video. Thumbs up from me.

Doug_B, May 31, 2018

#19

Peter77 likes this.
Aka_ToolBuddy

Joined:

Feb 25, 2014

Posts:

543

TJHeuvel-net said: ↑

https://forum.unity.com/threads/unity-mathematics-available-on-github.526100

They are also making a whole new library.

When working with huge amounts of Vectors, it can be benificial to avoid them at all. Just define three floats, its still a new object that is being created.
Click to expand...

The new Unity mathematics library has definitely its benefits, that are higher than the optimizations this forum thread is about. But using that library means you have to modify/rewrite parts of your code. The Vector3 (and similar) optimization works with 0 modification on your code.

What kills me the most is to know that this optimization should hardly take more than a man/day to Unity's developers to implement, which is peanuts knowing the increase of performance it creates (Peter77 spoke here about a 4% increase in his game). Knowing that people in Unity are aware of the existence of this optimization (Suggestion ticket + me writing to them), the most probable explanation I see is that the internal organization of the Unity company became so complicated that making such simple useful modifications became a daunting task.

Aka_ToolBuddy, Jun 4, 2018

#20
Aka_ToolBuddy

Joined:

Feb 25, 2014

Posts:

543

Peter77 said: ↑

I rewrote the IL of some Unity's DLLs and measured performance of a few applications. My conclusion was that Unity Technologies can achieve quite some performance improvements, with very little work, with trivial changes only, without actually changing something in user-code.

Yes, they do provide a new math lib, but to make use of it, you need to change your project. This probably give better performance, but it might also not be a trivial change. Therefore, if Unity would just change some simple code in their Vector classes, every existing Unity project would actually benefit from those changes automagically.

Here are my findings:
https://forum.unity.com/threads/wip...faster-without-any-changes-in-seconds.531169/
Click to expand...

Wow, that's some great tooling there. Thanks a lot Peter for making this, and pushing the idea beyond where I stopped.

Last edited: Jun 5, 2018

Aka_ToolBuddy, Jun 4, 2018

#21

Doug_B and Peter77 like this.
Aka_ToolBuddy

Joined:

Feb 25, 2014

Posts:

543

Doug_B said: ↑

I linked to it over on this other thread earlier on. Vote count has gone from 152 to 165 in three hours.
Click to expand...

Thanks a lot for spreading the word.

Aka_ToolBuddy, Jun 4, 2018

#22
Aka_ToolBuddy

Joined:

Feb 25, 2014

Posts:

543

Hi again,
I implemented the optimizations in a free asset called Frame Rate Booster. Here is its link https://assetstore.unity.com/packages/tools/utilities/frame-rate-booster-120660?aid=1101l3N9P
It is easy to use, just import it and rebuild your game.
To be completely aware of its limitations, please read the asset description.
Thanks for your interest everyone and have a nice day.

Last edited: May 4, 2020

Aka_ToolBuddy, Aug 30, 2018

#23

glenneroo, Joe-Censored, recursive and 1 other person like this.
scsc

Joined:

Oct 22, 2016

Posts:

3

I'm resurrecting the thread after almost 2 years, because it's still a top google result of phrases like "unity vector3 operator performance", and there have been no official updates yet. The suggestion at Unity's feedback site was also removed without any redirection, as they moved the feedback solely to the forums.

Is there a reason why this solution cannot be integrated officially by Unity in classes like Vector3? It doesn't require any effort, and could be easily back-integrated even to LTS Unity versions like 2018. Even static methods like Vector3.Distance(a, b) internally use a constructor for no good reason, and Vector3.SqrMagnitude(a - b) instead requires users to use a single Vector3 parameter, which invokes a constructor as a part of the minus operator. This produces a huge visible performance difference within Unity Profiler, affecting all of your libraries, for example if you use Mirror as a multiplayer solution. The optimization asset from the last post isn't on the Asset Store anymore, and I've noticed there's some ILOptimizer solution at https://forum.unity.com/threads/wip...faster-without-any-changes-in-seconds.531169/ , but I don't see any official recommendations. Of course, Unity currently tries to transition towards the DOTS approach with Unity.Mathematics library, but that doesn't affect any existing projects whatsoever. So what should we do to increase the performance of our libraries? Also because it's not documented anywhere, are there some compilers that already take care of this isuse, such that people who trust their profiler waste their time chasing a problem that doesn't exist, or was it simply ignored by Unity for over 2 years?

Futhermore, calling a function seems to have its own overhead (because it's not inlined), but I've observed that using the "in" (readonly ref) modifier of the parameter, such as "in Vector3 a" instead of "Vector3 a", seems to reduce this overhead by probably 10% - 50%, as the struct isn't unnecessarily copied. I think this should be simple for compilers to optimize automatically in static methods such as those in Vector3 class, but it also doesn't seem to be the case. Can anyone comment on that, such that people who also google for a solution can find some answers?

Here are some futher available links, which are even from before a year 2018, as I wasn't able to find any new official information:
2011 https://forum.unity.com/threads/vector3-operations-performance.103575/
2015 https://answers.unity.com/questions/1033383/code-performance-when-to-use-new-on-vector3.html (comment mentions the new operator isn't a real "new", which doesn't seem be the case based on our benchmarks)
2018 https://www.reddit.com/r/Unity3D/comments/7w0dvm/dear_unity_why_vector3_isnt_optimized/
2018 https://answers.unity.com/questions/1524021/performance-of-vector-addition-vs-component-additi.html
2020 https://answers.unity.com/questions/1698286/does-making-a-new-vector3-have-an-impact-on-perfor.html

scsc, May 4, 2020

#24

glenneroo and Lesnikus5 like this.
Aka_ToolBuddy

Joined:

Feb 25, 2014

Posts:

543

Hi,

Thanks for your interest in my asset, and for sharing the additional information about this subject.

Unity recently discontinued its asset store's old domain name, that's why the link from my post to Frame Rate Booster wasn't working anymore. I updated that link in my post above. The links on my website toolbuddy.net should always be up to date.

When it comes to why this isn't part of Unity yet, I am as annoyed as you, because like you said, it doesn't require any effort. I tried contacting people at Unity, and I either get no answer, or the irrelevant answer about using DOTS or Unity.Mathematics.

Have a nice day

Aka_ToolBuddy, May 4, 2020

#25

scsc likes this.
Aka_ToolBuddy

Joined:

Feb 25, 2014

Posts:

543
Hi again,
I have an update about the optimizations discussed in this thread and IL2CPP.
So like you might already know, IL2CPP transforms IL assemblies to C++ code, then builds that code targeting the selected platform. A reasonable assumption would be that the output of IL2CPP when using a non optimized IL assembly should be slower than the output when using optimized IL. Unfortunately this is not the case. In my tests the output using optimized IL assembly was even slower. Before explaining to you why, please keep in mind that I have virtually zero experience with c++, so please correct me if I am wrong in my explanation:

This is how IL2CPP transforms a non optimized vector3 addition:

Code (CSharp):

public static Vector3 operator +(Vector3 a, Vector3 b)

{

return new Vector3(a.x + b.x, a.y + b.y, a.z + b.z);

}

becomes

Code (CSharp):

// UnityEngine.Vector3 UnityEngine.Vector3::op_Addition(UnityEngine.Vector3,UnityEngine.Vector3)

IL2CPP_EXTERN_C IL2CPP_METHOD_ATTR Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 Vector3_op_Addition_m929F9C17E5D11B94D50B4AFF1D730B70CB59B50E (Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 ___a0, Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 ___b1, const RuntimeMethod* method)

{

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 V_0;

memset((&V_0), 0, sizeof(V_0));

{

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_0 = ___a0;

float L_1 = L_0.get_x_0();

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_2 = ___b1;

float L_3 = L_2.get_x_0();

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_4 = ___a0;

float L_5 = L_4.get_y_1();

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_6 = ___b1;

float L_7 = L_6.get_y_1();

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_8 = ___a0;

float L_9 = L_8.get_z_2();

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_10 = ___b1;

float L_11 = L_10.get_z_2();

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_12;

memset((&L_12), 0, sizeof(L_12));

Vector3__ctor_m08F61F548AA5836D8789843ACB4A81E4963D2EE1((&L_12), ((float)il2cpp_codegen_add((float)L_1, (float)L_3)), ((float)il2cpp_codegen_add((float)L_5, (float)L_7)), ((float)il2cpp_codegen_add((float)L_9, (float)L_11)), /*hidden argument*/NULL);

V_0 = L_12;

goto IL_0030;

}

IL_0030:

{

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_13 = V_0;

return L_13;

}

}

As you can see, and as for the non optimized IL, the non optimized C++ code allocates a vector3 that is not used and overridden further.

Here is the IL2CPP result when run on the optimized version

Code (CSharp):

public static Vector3 operator +(Vector3 a, Vector3 b)

{

a.x += b.x;

a.y += b.y;

a.z += b.z;

return a;

}

becomes

Code (CSharp):

// UnityEngine.Vector3 UnityEngine.Vector3::op_Addition(UnityEngine.Vector3,UnityEngine.Vector3)

IL2CPP_EXTERN_C IL2CPP_METHOD_ATTR Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 Vector3_op_Addition_m929F9C17E5D11B94D50B4AFF1D730B70CB59B50E (Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 ___a0, Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 ___b1, const RuntimeMethod* method)

{

{

float* L_0 = (&___a0)->get_address_of_x_0();

float* L_1 = L_0;

float L_2 = *((float*)L_1);

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_3 = ___b1;

float L_4 = L_3.get_x_0();

*((float*)L_1) = (float)((float)il2cpp_codegen_add((float)L_2, (float)L_4));

float* L_5 = (&___a0)->get_address_of_y_1();

float* L_6 = L_5;

float L_7 = *((float*)L_6);

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_8 = ___b1;

float L_9 = L_8.get_y_1();

*((float*)L_6) = (float)((float)il2cpp_codegen_add((float)L_7, (float)L_9));

float* L_10 = (&___a0)->get_address_of_z_2();

float* L_11 = L_10;

float L_12 = *((float*)L_11);

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_13 = ___b1;

float L_14 = L_13.get_z_2();

*((float*)L_11) = (float)((float)il2cpp_codegen_add((float)L_12, (float)L_14));

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_15 = ___a0;

return L_15;

}

}

As you can see, the unnecessary memory allocation is no more, but for some reason the access to the vector x,y and z fields is done in a complicated and slow way, which nullifies the optimization done at the IL level.

So if the IL2CPP generated code accessed the x,y and z fields the simple and fast way, the optimization of the IL could be useful also for projects using IL2CPP.

I am planning on contacting someone on Unity about this. I hope this time they will be responsive. I will keep you update if I have any answer, and of course I will update my asset to be compatible with IL2CPP once the problem is solved.

Please share with me your thoughts, and have a nice day
Last edited: Jun 10, 2020

Aka_ToolBuddy, Jun 9, 2020

#26

Lesnikus5, Vincenzo, Acissathar and 3 others like this.
Kamyker

Joined:

May 14, 2013

Posts:

1,090
Btw Unity.Mathematics uses similar c# code:
https://github.com/Unity-Technologi...2c4b/src/Unity.Mathematics/float3.gen.cs#L224

Code (CSharp):

[MethodImpl(MethodImplOptions.AggressiveInlining)]

public static float3 operator + (float3 lhs, float3 rhs) { return new float3 (lhs.x + rhs.x, lhs.y + rhs.y, lhs.z + rhs.z); }

Interesting that's it's slower than default but I guess AggressiveInlining makes it faster anyway in il2cpp
Kamyker, Jun 9, 2020

#27
Aka_ToolBuddy

Joined:

Feb 25, 2014

Posts:

543
Kamyker said: ↑

Btw Unity.Mathematics uses similar c# code:
https://github.com/Unity-Technologi...2c4b/src/Unity.Mathematics/float3.gen.cs#L224

Code (CSharp):

[MethodImpl(MethodImplOptions.AggressiveInlining)]

public static float3 operator + (float3 lhs, float3 rhs) { return new float3 (lhs.x + rhs.x, lhs.y + rhs.y, lhs.z + rhs.z); }

Interesting that's it's slower than default but I guess AggressiveInlining makes it faster anyway in il2cpp
Click to expand...

I don't believe that the inlining is what explains the performance of Mathematics.

When speaking about Vector3 (and similar), the fact that the addition (for example) is implemented as "return new Vector3(a.x + b.x, a.y + b.y, a.z + b.z);" instead of my optimized implementation (that does not call the constructor) is not the problem. The problem is that the compiler is not smart enough to optimize Unity's implementation by skipping the constructor's call. My implementation is just a way to force the compiler to not call the constructor. From my understanding, other C# compilers do that optimization.

I am not familiar with Unity.Mathematics, but from my understanding it uses a different, specially optimized, compiler. So there is nothing strange in having different performance between the addition of float3 and Vector3 even if they have the same C# implementation.
Aka_ToolBuddy, Jun 10, 2020

#28
JoshPeterson

Unity Technologies

Joined:

Jul 21, 2014

Posts:

6,931
Aka_ToolBuddy said: ↑

Hi again,
I have an update about the optimizations discussed in this thread and IL2CPP.
So like you might already know, IL2CPP transforms IL assemblies to C++ code, then builds that code targeting the selected platform. A reasonable assumption would be that the output of IL2CPP when using a non optimized IL assembly should be slower than the output when using optimized IL. Unfortunately this is not the case. In my tests the output using optimized IL assembly was even slower. Before explaining to you why, please keep in mind that I have virtually zero experience with c++, so please correct me if I am wrong in my explanation:

This is how IL2CPP transforms a non optimized vector3 addition:

Code (CSharp):

public static Vector3 operator +(Vector3 a, Vector3 b)

{

return new Vector3(a.x + b.x, a.y + b.y, a.z + b.z);

}

becomes

Code (CSharp):

// UnityEngine.Vector3 UnityEngine.Vector3::op_Addition(UnityEngine.Vector3,UnityEngine.Vector3)

IL2CPP_EXTERN_C IL2CPP_METHOD_ATTR Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 Vector3_op_Addition_m929F9C17E5D11B94D50B4AFF1D730B70CB59B50E (Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 ___a0, Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 ___b1, const RuntimeMethod* method)

{

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 V_0;

memset((&V_0), 0, sizeof(V_0));

{

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_0 = ___a0;

float L_1 = L_0.get_x_0();

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_2 = ___b1;

float L_3 = L_2.get_x_0();

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_4 = ___a0;

float L_5 = L_4.get_y_1();

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_6 = ___b1;

float L_7 = L_6.get_y_1();

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_8 = ___a0;

float L_9 = L_8.get_z_2();

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_10 = ___b1;

float L_11 = L_10.get_z_2();

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_12;

memset((&L_12), 0, sizeof(L_12));

Vector3__ctor_m08F61F548AA5836D8789843ACB4A81E4963D2EE1((&L_12), ((float)il2cpp_codegen_add((float)L_1, (float)L_3)), ((float)il2cpp_codegen_add((float)L_5, (float)L_7)), ((float)il2cpp_codegen_add((float)L_9, (float)L_11)), /*hidden argument*/NULL);

V_0 = L_12;

goto IL_0030;

}

IL_0030:

{

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_13 = V_0;

return L_13;

}

}

As you can see, and as for the non optimized IL, the non optimized C++ code allocates a vector3 that is not used and overridden further.

Here is the IL2CPP result when run on the optimized version

Code (CSharp):

public static Vector3 operator +(Vector3 a, Vector3 b)

{

a.x += b.x;

a.y += b.y;

a.z += b.z;

return a;

}

becomes

Code (CSharp):

// UnityEngine.Vector3 UnityEngine.Vector3::op_Addition(UnityEngine.Vector3,UnityEngine.Vector3)

IL2CPP_EXTERN_C IL2CPP_METHOD_ATTR Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 Vector3_op_Addition_m929F9C17E5D11B94D50B4AFF1D730B70CB59B50E (Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 ___a0, Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 ___b1, const RuntimeMethod* method)

{

{

float* L_0 = (&___a0)->get_address_of_x_0();

float* L_1 = L_0;

float L_2 = *((float*)L_1);

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_3 = ___b1;

float L_4 = L_3.get_x_0();

*((float*)L_1) = (float)((float)il2cpp_codegen_add((float)L_2, (float)L_4));

float* L_5 = (&___a0)->get_address_of_y_1();

float* L_6 = L_5;

float L_7 = *((float*)L_6);

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_8 = ___b1;

float L_9 = L_8.get_y_1();

*((float*)L_6) = (float)((float)il2cpp_codegen_add((float)L_7, (float)L_9));

float* L_10 = (&___a0)->get_address_of_z_2();

float* L_11 = L_10;

float L_12 = *((float*)L_11);

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_13 = ___b1;

float L_14 = L_13.get_z_2();

*((float*)L_11) = (float)((float)il2cpp_codegen_add((float)L_12, (float)L_14));

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_15 = ___a0;

return L_15;

}

}

As you can see, the unnecessary memory allocation is no more, but for some reason the access to the vector x,y and z fields is done in a complicated and slow way, which nullifies the optimization done at the IL level.

So if the IL2CPP generated code accessed the x,y and z fields the simple and fast way, the optimization of the IL could be useful also for projects using IL2CPP.

I am planning on contacting someone on Unity about this. I hope this time they will be responsive. I will keep you update if I have any answer, and of course I will update my asset to be compatible with IL2CPP once the problem is solved.

Please share with me your thoughts, and have a nice day
Click to expand...

It is important to note that the two different C# code snippets here do to vastly different things.

Code (CSharp):

public static Vector3 operator +(Vector3 a, Vector3 b)

{

return new Vector3(a.x + b.x, a.y + b.y, a.z + b.z);

}

This method creates a new Vector3 that represents the sum of a and b.

Code (CSharp):

public static Vector3 operator +(Vector3 a, Vector3 b)

{

a.x += b.x;

a.y += b.y;

a.z += b.z;

return a;

}

This method is something like a += operator, adding a and b and storing the result in a.

Aka_ToolBuddy said: ↑

When speaking about Vector3 (and similar), the fact that the addition (for example) is implemented as "return new Vector3(a.x + b.x, a.y + b.y, a.z + b.z);" instead of my optimized implementation (that does not call the constructor) is not the problem. The problem is that the compiler is not smart enough to optimize Unity's implementation by skipping the constructor's call.
Click to expand...

In this case, it is not a matter of missing optimization, unfortunately. IL2CPP is doing the minimum that needs to be done in both cases.

Thanks for the investigation though! I'd recommend you have a look at the latest 2020.2 alpha release of Unity. We've made some changes to improve the performance of Vector3 (and similar math operations) recently.
JoshPeterson, Jun 10, 2020

#29
Aka_ToolBuddy

Joined:

Feb 25, 2014

Posts:

543
JoshPeterson said: ↑

It is important to note that the two different C# code snippets here do to vastly different things.
Click to expand...

But the end result is the same, right?: you get the sum of A and B. Since Vectors are copied by value and not by reference, it doesn't matter that we create a new vector or += an existing one.

JoshPeterson said: ↑

In this case, it is not a matter of missing optimization, unfortunately. IL2CPP is doing the minimum that needs to be done in both cases.
Click to expand...

Just to state clearly what I think can be improved:
In the IL2CPP output of the default Vector3 + operator implementation, here is how x values are accessed:

Code (CSharp):

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_0 = ___a0;

float L_1 = L_0.get_x_0();

Simple.

In the IL2CPP output of the optimized Vector3 + operator implementation, here is how x values are accessed:

Code (CSharp):

float* L_0 = (&___a0)->get_address_of_x_0();

float* L_1 = L_0

float L_2 = *((float*)L_1);

We agree that this code is too complicated, right? Can't IL2CPP be enhanced to avoid such complicated and slow access to x?

JoshPeterson said: ↑

Thanks for the investigation though! I'd recommend you have a look at the latest 2020.2 alpha release of Unity. We've made some changes to improve the performance of Vector3 (and similar math operations) recently.
Click to expand...

I will soon, thanks for the information.
Aka_ToolBuddy, Jun 10, 2020

#30

SAMYTHEBIGJUICY, glenneroo and siberhecy like this.
JoshPeterson

Unity Technologies

Joined:

Jul 21, 2014

Posts:

6,931

Aka_ToolBuddy said: ↑

But the end result is the same, right?: you get the sum of A and B. Since Vectors are copied by value and not by reference, it doesn't matter that we create a new vector or += an existing one.
Click to expand...

Yes, good point. I was thinking of reference types, but Vector3 is a value type. You are correct.

Still, IL2CPP can only do what the IL code tells it to do, so it is doing the right thing in both cases, although the C# and resulting IL code could be more efficient.

Aka_ToolBuddy said: ↑

Just to state clearly what I think can be improved:
In the IL2CPP output of the default Vector3 + operator implementation, here is how x values are accessed:
Simple.

In the IL2CPP output of the optimized Vector3 + operator implementation, here is how x values are accessed:
We agree that this code is too complicated, right? Can't IL2CPP be enhanced to avoid such complicated and slow access to x?
Click to expand...

No, this is not too complicated. If the IL code indicates that the address of each field should be accessed, then IL2CPP must do that.

Note that IL2CPP is almost never attempting to optimize IL code. It is instead transpiling it to C++ mostly as-is (it does a few optimizations, but not many). Our goal is to use the C++ compiler to do the optimizations. I'd be interested to see the output assembly code in both of these cases - I suspect that it will be similar.

JoshPeterson, Jun 10, 2020

#31
Baste

Joined:

Jan 24, 2013

Posts:

6,334

Yup, @JoshPeterson, @Aka_ToolBuddy is right. Since Vector3 is a struct, modifying a has no side-effects outside the method.

The whole IL2CPP thing here is a bit annoying as well - the community has been pretty clear about how the Vector3 class could be improved a bunch pretty trivially, and the answer is always "well, it'll be optimized in IL2CPP, which you should build with".

Which is like... we still use Mono in the editor! Editor performance is important as well!

Edit: beat me to it. Let me actually check if that performance improvement is as trivial as I seem to remember that it is.

Baste, Jun 10, 2020

#32
JoshPeterson

Unity Technologies

Joined:

Jul 21, 2014

Posts:

6,931

Baste said: ↑

The whole IL2CPP thing here is a bit annoying as well - the community has been pretty clear about how the Vector3 class could be improved a bunch pretty trivially, and the answer is always "well, it'll be optimized in IL2CPP, which you should build with".
Click to expand...

Sorry, I was not aware that this is the stance anyone at Unity was taking. At least from the VM team side, this is not the case.

As I mentioned above, check out the latest Unity 2020.2 alpha releases. We've made some improvements to Vector3 and other math operations. We're open to making more as well. These improvements help across IL2CPP and Mono.

JoshPeterson, Jun 10, 2020

#33

Baste likes this.
Peter77

QA Jesus

Joined:

Jun 12, 2013

Posts:

6,609

The generated IL2CPP code looks quite inefficient. Why all the pointer operations, memory shoveling and function calls?

Last edited: Jun 10, 2020

Peter77, Jun 10, 2020

#34

Aka_ToolBuddy likes this.
Aka_ToolBuddy

Joined:

Feb 25, 2014

Posts:

543

Baste said: ↑

The whole IL2CPP thing here is a bit annoying as well - the community has been pretty clear about how the Vector3 class could be improved a bunch pretty trivially, and the answer is always "well, it'll be optimized in IL2CPP, which you should build with".
Click to expand...

Also another answer I frequently get, including from a Unity dev, is to use Unity.Mathematics. I never liked that answer because it assumes that changing your whole code base to use that lib is something trivial

Baste said: ↑

Edit: beat me to it. Let me actually check if that performance improvement is as trivial as I seem to remember that it is.
Click to expand...

Here is an extract of Frame Rate Booster's description:

How much frame rate increase should I expect?
It depends on how heavily your code relies on operations on vectors, quaternions and similar objects. The more such operations there are, the better the optimization will be.
* On benchmarks, I had a 10% increase.
* On my other asset, Curvy Splines, I got also a 10% increase for operations like mesh generation and splines cache building.
* On games doing thousands of geometry operations per frame (like moving a lot of objects), I expect a few percent increase at most. Not too much, but hey, it's free!
* On the remaining situations, I don't expect any noticeable increase.

Last edited: Sep 13, 2021

Aka_ToolBuddy, Jun 10, 2020

#35
Aka_ToolBuddy

Joined:

Feb 25, 2014

Posts:

543

JoshPeterson said: ↑

Note that IL2CPP is almost never attempting to optimize IL code. It is instead transpiling it to C++ mostly as-is (it does a few optimizations, but not many). Our goal is to use the C++ compiler to do the optimizations. I'd be interested to see the output assembly code in both of these cases - I suspect that it will be similar.
Click to expand...

I didn't looked at the output assembly code to compare, but from running both of them, the build using the optimized C# implementation was 50% slower than the other in a test build that does only vector3 additions.

Aka_ToolBuddy, Jun 10, 2020

#36
Aka_ToolBuddy

Joined:

Feb 25, 2014

Posts:

543
JoshPeterson said: ↑

No, this is not too complicated. If the IL code indicates that the address of each field should be accessed, then IL2CPP must do that.
Click to expand...

@JoshPeterson @Peter77
Wouldn't this code work and be faster?

Code (CSharp):

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_1 = ___a0;

float L_2 = L_1.get_x_0();

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_3 = ___b1;

float L_4 = L_3.get_x_0();

___a0.set_x_0((float)il2cpp_codegen_add((float)L_2, (float)L_4);

instead of

Code (CSharp):

float* L_0 = (&___a0)->get_address_of_x_0();

float* L_1 = L_0;

float L_2 = *((float*)L_1);

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_3 = ___b1;

float L_4 = L_3.get_x_0();

*((float*)L_1) = (float)((float)il2cpp_codegen_add((float)L_2, (float)L_4));

Thanks
Aka_ToolBuddy, Jun 10, 2020

#37
Baste

Joined:

Jan 24, 2013

Posts:

6,334
JoshPeterson said: ↑

Sorry, I was not aware that this is the stance anyone at Unity was taking. At least from the VM team side, this is not the case.

As I mentioned above, check out the latest Unity 2020.2 alpha releases. We've made some improvements to Vector3 and other math operations. We're open to making more as well. These improvements help across IL2CPP and Mono.
Click to expand...

Yeah, sorry, I was being a bit over the top there. Not meaning to make any assumptions! It's just that a lot of the replies we get when we complain about perf is on the form "it's faster with IL2CPP/burst/builds, did you test that?". That's often really annoying, since when we're running in editor, none of those matter. Except maybe burst sometimes?

About Vector3 specifically, I think Frame Rate Booster linked above shows some pretty clear possible improvements. I cooked up a little test, and when comparing Vector3.LerpUnclamped between Unity's implementation and their implementation, it seems like their implementation takes about 80-90% of the time Unity's does. The "trick" in most of the optimizations is simply reusing the input argument as the output.

Tests done on 2020.2.0a13:

Code (csharp):

// Vector3 copy with Unity's and FrameRateBooster's version:

using System.Runtime.CompilerServices;

using System.Runtime.InteropServices;

[StructLayout(LayoutKind.Sequential)]

public struct MyVector3 {

// X component of the vector.

public float x;

// Y component of the vector.

public float y;

// Z component of the vector.

public float z;

// Creates a new vector with given x, y, z components.

[MethodImpl(MethodImplOptions.AggressiveInlining)]

public MyVector3(float x, float y, float z) { this.x = x; this.y = y; this.z = z; }

// Linearly interpolates between two vectors without clamping the interpolant

[MethodImpl(MethodImplOptions.AggressiveInlining)]

public static MyVector3 LerpUnclamped(MyVector3 a, MyVector3 b, float t) // Copied from the C# reference

{

return new MyVector3(

a.x + (b.x - a.x) * t,

a.y + (b.y - a.y) * t,

a.z + (b.z - a.z) * t

);

}

// Linearly interpolates between two vectors without clamping the interpolant

[MethodImpl(MethodImplOptions.AggressiveInlining)]

public static MyVector3 LerpUnclamped_2(MyVector3 a, MyVector3 b, float t) // Copied from Frame Rate Booster

{

a.x += (b.x - a.x) * t;

a.y += (b.y - a.y) * t;

a.z += (b.z - a.z) * t;

return a;

}

}

/// Test script.

using TMPro;

using UnityEngine;

using UnityEngine.Profiling;

public class TestPerf : MonoBehaviour

{

public TextMeshProUGUI result;

void Update()

{

MyVector3 a = new MyVector3(Random.value, Random.value, Random.value);

MyVector3 b = new MyVector3(Random.value, Random.value, Random.value);

float t = Random.value;

Profiler.BeginSample("Unity");

var result_a = 0f;

for (int i = 0; i < 10000000; i++)

{

var lerped = MyVector3.LerpUnclamped(a, b, t);

result_a += lerped.x + lerped.y + lerped.z; // paranoid about compiler optimizing away stuff!

}

Profiler.EndSample(); // Profile Analyzer: Mean is 229.60

Profiler.BeginSample("FrameRateBooster");

var result_b = 0f;

for (int i = 0; i < 10000000; i++)

{

var lerped = MyVector3.LerpUnclamped_2(a, b, t);

result_b += lerped.x + lerped.y + lerped.z; // paranoid about compiler optimizing away stuff!

}

Profiler.EndSample(); // Profile Analyzer: Mean is 186.99

result.text = $"(do not optimize: {result_a + result_b}!)";

}

}

I'm getting the same kinds of results when not attaching the profiler and just printing the results to screen. The methodology isn't exactly perfect, but it looks like there's a real, tangible difference, so it's for sure worthwhile to look into.
Baste, Jun 10, 2020

#38

glenneroo and Aka_ToolBuddy like this.
Aka_ToolBuddy

Joined:

Feb 25, 2014

Posts:

543

Baste said: ↑

Tests done on 2020.2.0a13
Click to expand...

Thanks, that avoided me doing the tests myself

Aka_ToolBuddy, Jun 10, 2020

#39
Aka_ToolBuddy

Joined:

Feb 25, 2014

Posts:

543

Baste said: ↑

The methodology isn't exactly perfect, but it looks like there's a real, tangible difference, so it's for sure worthwhile to look into.
Click to expand...

I can tell you from users feedback that some of them feel a real difference. I would have loved that users using IL2CPP, which is usually the default choice for people seeking performance, can't use Frame Rate Booster for now.

Aka_ToolBuddy, Jun 10, 2020

#40
Peter77

QA Jesus

Joined:

Jun 12, 2013

Posts:

6,609
Aka_ToolBuddy said: ↑

Code (CSharp):

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_1 = ___a0;

float L_2 = L_1.get_x_0();

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_3 = ___b1;

float L_4 = L_3.get_x_0();

___a0.set_x_0((float)il2cpp_codegen_add((float)L_2, (float)L_4);

Click to expand...

You wouldn't write that code by hand and even for generated code, I wonder what the rational thought behind "let's do it this way" was.

Maybe all this is genius. The C++ compiler can work in mysterious ways and it can differ greatly between platforms. One had to look at the disassembly to see what instructions it generates. Maybe the IL2CPP generated code gets optimized perfectly by the compiler and that's why it looks like that.

But I would assume this could unnecessarily move 3 floats around:

Vector3_tDCF05E21F632FE2BA260C06E0D10CA81513E6720 L_1 = ___a0;

And this could unnecessarily cause a function call:

float L_2 = L_1.get_x_0();

In C++, I would just write something very equivalent to the C# version:

Code (csharp):

inline const Vector3 operator+ (const Vector3 &a, const Vector3 &b)

{

Vector3 v;

v.x = a.x + b.x;

v.y = a.y + b.y;

v.z = a.z + b.z;

return v;

// alternatively: return Vector3(a.x+b.x,a.y+b.y,a.z+b.z);

}

This looks less complicated and can probably better optimized by the C++ toolchain.
Peter77, Jun 10, 2020

#41
JoshPeterson

Unity Technologies

Joined:

Jul 21, 2014

Posts:

6,931
I think there might be two different issues here we are discussing:

Is IL2CPP generating the best code possible regarding the current Vector3 implementation?

Can the Vector3 C# implementation be improved to provide better performance?

TL;DR

For (1) I believe the answer is yes. We can look at generated IL and assembly code to explore this. For (2), I believe the answer is also yes, but this exploration requires profiling, which is a bit more difficult to do on the forums.

Let's start with (1). Here is the example code I've looked at. I've tried to get the Unity Vector3 type represented in a standalone .NET executable here:

Code (CSharp):

using System;

namespace ConsoleApp1

{

class Program

{

public struct Vector3

{

public float x;

public float y;

public float z;

public Vector3(float x, float y, float z) { this.x = x; this.y = y; this.z = z; }

public static Vector3 Add1(Vector3 a, Vector3 b)

{

return new Vector3(a.x + b.x, a.y + b.y, a.z + b.z);

}

public static Vector3 Add2(Vector3 a, Vector3 b)

{

a.x += b.x;

a.y += b.y;

a.z += b.z;

return a;

}

static readonly Vector3 zeroVector = new Vector3(0F, 0F, 0F);

}

static void Main(string[] args)

{

var a = new Vector3(1.0f, 2.0f, 3.0f);

var b = new Vector3(4.0f, 5.0f, 6.0f);

var result1 = Vector3.Add1(a, b);

var result2 = Vector3.Add2(a, b);

Console.WriteLine($"result1: {result1.x}, {result1.y}, {result1.z}");

Console.WriteLine($"result2: {result2.x}, {result2.y}, {result2.z}");

}

}

The Add1 method is pretty much the current Unity Vector3 implementation. The Add2 method is the better implementation proposed by @Aka_ToolBuddy.

Let's start with Add1.

Here is the IL code (using ILSpy):

Code (CSharp):

.method public hidebysig static

valuetype ConsoleApp1.Program/Vector3 Add1 (

valuetype ConsoleApp1.Program/Vector3 a,

valuetype ConsoleApp1.Program/Vector3 b

) cil managed

{

// Method begins at RVA 0x2115

// Code size 45 (0x2d)

.maxstack 8

// return new Vector3(a.x + b.x, a.y + b.y, a.z + b.z);

IL_0000: ldarg.0

IL_0001: ldfld float32 ConsoleApp1.Program/Vector3::x

IL_0006: ldarg.1

IL_0007: ldfld float32 ConsoleApp1.Program/Vector3::x

IL_000c: add

IL_000d: ldarg.0

IL_000e: ldfld float32 ConsoleApp1.Program/Vector3::y

IL_0013: ldarg.1

IL_0014: ldfld float32 ConsoleApp1.Program/Vector3::y

IL_0019: add

IL_001a: ldarg.0

IL_001b: ldfld float32 ConsoleApp1.Program/Vector3::z

IL_0020: ldarg.1

IL_0021: ldfld float32 ConsoleApp1.Program/Vector3::z

IL_0026: add

IL_0027: newobj instance void ConsoleApp1.Program/Vector3::.ctor(float32, float32, float32)

// (no C# code)

IL_002c: ret

} // end of method Vector3::Add1

Here is the C++ generated by IL2CPP:

Code (CSharp):

// ConsoleApp1.Program/Vector3 ConsoleApp1.Program/Vector3::Add1(ConsoleApp1.Program/Vector3,ConsoleApp1.Program/Vector3)

IL2CPP_EXTERN_C IL2CPP_METHOD_ATTR Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874 Vector3_Add1_m1B0E5B87661EFBBFC00A2B53997A3DE3BD26A88C (Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874 ___a0, Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874 ___b1, const RuntimeMethod* method)

{

{

// return new Vector3(a.x + b.x, a.y + b.y, a.z + b.z);

Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874 L_0 = ___a0;

float L_1 = L_0.get_x_0();

Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874 L_2 = ___b1;

float L_3 = L_2.get_x_0();

Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874 L_4 = ___a0;

float L_5 = L_4.get_y_1();

Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874 L_6 = ___b1;

float L_7 = L_6.get_y_1();

Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874 L_8 = ___a0;

float L_9 = L_8.get_z_2();

Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874 L_10 = ___b1;

float L_11 = L_10.get_z_2();

Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874 L_12;

memset((&L_12), 0, sizeof(L_12));

Vector3__ctor_mC283B125D085C9C6206E78FF497393B52E64F032((&L_12), ((float)il2cpp_codegen_add((float)L_1, (float)L_3)), ((float)il2cpp_codegen_add((float)L_5, (float)L_7)), ((float)il2cpp_codegen_add((float)L_9, (float)L_11)), /*hidden argument*/NULL);

return L_12;

}

}

Here is the x64 assembly generated for a release build with Visual Studio 2019:

Code (CSharp):

Vector3_Add1_m1B0E5B87661EFBBFC00A2B53997A3DE3BD26A88C PROC ; COMDAT

; File C:\Users\joshu\AppData\Local\Temp\il2cpp\il2cpp__ConsoleApp1\generatedcpp\ConsoleApp1.cpp

; Line 397

$LN18:

sub rsp, 56 ; 00000038H

; Line 408

mov eax, DWORD PTR [rdx+8]

; File C:\code\il2cpp\libil2cpp\codegen\il2cpp-codegen-common-small.h

; Line 56

movss xmm0, DWORD PTR [rdx]

addss xmm0, DWORD PTR [r8]

movss xmm1, DWORD PTR [rdx+4]

addss xmm1, DWORD PTR [r8+4]

; File C:\Users\joshu\AppData\Local\Temp\il2cpp\il2cpp__ConsoleApp1\generatedcpp\ConsoleApp1.cpp

; Line 408

mov DWORD PTR L_8$1[rsp+8], eax

; Line 410

mov eax, DWORD PTR [r8+8]

; Line 143

movss DWORD PTR [rcx], xmm0

; File C:\code\il2cpp\libil2cpp\codegen\il2cpp-codegen-common-small.h

; Line 56

movss xmm0, DWORD PTR L_8$1[rsp+8]

; File C:\Users\joshu\AppData\Local\Temp\il2cpp\il2cpp__ConsoleApp1\generatedcpp\ConsoleApp1.cpp

; Line 410

mov DWORD PTR L_10$2[rsp+8], eax

; Line 415

mov rax, rcx

; File C:\code\il2cpp\libil2cpp\codegen\il2cpp-codegen-common-small.h

; Line 56

addss xmm0, DWORD PTR L_10$2[rsp+8]

; File C:\Users\joshu\AppData\Local\Temp\il2cpp\il2cpp__ConsoleApp1\generatedcpp\ConsoleApp1.cpp

; Line 381

movss DWORD PTR [rcx+4], xmm1

; Line 384

movss DWORD PTR [rcx+8], xmm0

; Line 417

add rsp, 56 ; 00000038H

ret 0

Vector3_Add1_m1B0E5B87661EFBBFC00A2B53997A3DE3BD26A88C ENDP

Now let's look at Add2:

First the IL code:

Code (CSharp):

.method public hidebysig static

valuetype ConsoleApp1.Program/Vector3 Add2 (

valuetype ConsoleApp1.Program/Vector3 a,

valuetype ConsoleApp1.Program/Vector3 b

) cil managed

{

// Method begins at RVA 0x2143

// Code size 53 (0x35)

.maxstack 8

// a.x += b.x;

IL_0000: ldarga.s a

IL_0002: ldflda float32 ConsoleApp1.Program/Vector3::x

IL_0007: dup

IL_0008: ldind.r4

IL_0009: ldarg.1

IL_000a: ldfld float32 ConsoleApp1.Program/Vector3::x

IL_000f: add

// (no C# code)

IL_0010: stind.r4

// a.y += b.y;

IL_0011: ldarga.s a

IL_0013: ldflda float32 ConsoleApp1.Program/Vector3::y

IL_0018: dup

IL_0019: ldind.r4

IL_001a: ldarg.1

IL_001b: ldfld float32 ConsoleApp1.Program/Vector3::y

IL_0020: add

// (no C# code)

IL_0021: stind.r4

// a.z += b.z;

IL_0022: ldarga.s a

IL_0024: ldflda float32 ConsoleApp1.Program/Vector3::z

IL_0029: dup

IL_002a: ldind.r4

IL_002b: ldarg.1

IL_002c: ldfld float32 ConsoleApp1.Program/Vector3::z

IL_0031: add

// (no C# code)

IL_0032: stind.r4

// return a;

IL_0033: ldarg.0

// (no C# code)

IL_0034: ret

} // end of method Vector3::Add2

Now the generated C++ code from IL2CPP:

Code (CSharp):

// ConsoleApp1.Program/Vector3 ConsoleApp1.Program/Vector3::Add2(ConsoleApp1.Program/Vector3,ConsoleApp1.Program/Vector3)

IL2CPP_EXTERN_C IL2CPP_METHOD_ATTR Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874 Vector3_Add2_mFC078D81430196FD3B0A2A4675EA50446FE3A0CF (Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874 ___a0, Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874 ___b1, const RuntimeMethod* method)

{

{

// a.x += b.x;

float* L_0 = (&___a0)->get_address_of_x_0();

float* L_1 = L_0;

float L_2 = *((float*)L_1);

Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874 L_3 = ___b1;

float L_4 = L_3.get_x_0();

*((float*)L_1) = (float)((float)il2cpp_codegen_add((float)L_2, (float)L_4));

// a.y += b.y;

float* L_5 = (&___a0)->get_address_of_y_1();

float* L_6 = L_5;

float L_7 = *((float*)L_6);

Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874 L_8 = ___b1;

float L_9 = L_8.get_y_1();

*((float*)L_6) = (float)((float)il2cpp_codegen_add((float)L_7, (float)L_9));

// a.z += b.z;

float* L_10 = (&___a0)->get_address_of_z_2();

float* L_11 = L_10;

float L_12 = *((float*)L_11);

Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874 L_13 = ___b1;

float L_14 = L_13.get_z_2();

*((float*)L_11) = (float)((float)il2cpp_codegen_add((float)L_12, (float)L_14));

// return a;

Vector3_tFFCBF7F002E45CB5A532CF44BF063C204B6FF874 L_15 = ___a0;

return L_15;

}

}

And finally the assembly code:

Code (CSharp):

Vector3_Add2_mFC078D81430196FD3B0A2A4675EA50446FE3A0CF PROC ; COMDAT

; File C:\Users\joshu\AppData\Local\Temp\il2cpp\il2cpp__ConsoleApp1\generatedcpp\ConsoleApp1.cpp

; Line 420

$LN12:

sub rsp, 24

; File C:\code\il2cpp\libil2cpp\codegen\il2cpp-codegen-common-small.h

; Line 56

movss xmm0, DWORD PTR [rdx]

addss xmm0, DWORD PTR [r8]

; File C:\Users\joshu\AppData\Local\Temp\il2cpp\il2cpp__ConsoleApp1\generatedcpp\ConsoleApp1.cpp

; Line 433

mov eax, DWORD PTR [r8+8]

mov DWORD PTR L_8$2[rsp+8], eax

; Line 440

mov DWORD PTR L_13$1[rsp+8], eax

movss DWORD PTR [rdx], xmm0

; File C:\code\il2cpp\libil2cpp\codegen\il2cpp-codegen-common-small.h

; Line 56

movss xmm0, DWORD PTR [r8+4]

addss xmm0, DWORD PTR [rdx+4]

; File C:\Users\joshu\AppData\Local\Temp\il2cpp\il2cpp__ConsoleApp1\generatedcpp\ConsoleApp1.cpp

; Line 435

movss DWORD PTR [rdx+4], xmm0

; File C:\code\il2cpp\libil2cpp\codegen\il2cpp-codegen-common-small.h

; Line 56

movss xmm0, DWORD PTR L_13$1[rsp+8]

addss xmm0, DWORD PTR [rdx+8]

; File C:\Users\joshu\AppData\Local\Temp\il2cpp\il2cpp__ConsoleApp1\generatedcpp\ConsoleApp1.cpp

; Line 442

movss DWORD PTR [rdx+8], xmm0

; Line 444

mov eax, DWORD PTR [rdx+8]

movsd xmm0, QWORD PTR [rdx]

movsd QWORD PTR [rcx], xmm0

mov DWORD PTR [rcx+8], eax

; Line 445

mov rax, rcx

; Line 447

add rsp, 24

ret 0

Vector3_Add2_mFC078D81430196FD3B0A2A4675EA50446FE3A0CF ENDP

Ok, so that is lot!

The key thing to understand about IL2CPP is that it is not really making many optimizations. Its job is to translate the IL code into C++ code, then it relies on the C++ compiler to optimize it.

The generated assembly code for the two implementations looks pretty similar, and this makes me happy - in the end we want to add two vectors, so hopefully the code that ends up running is pretty much the same. Note especially that in neither case does the generated assembly code call any functions - everything is inlined. That likely happens because I'm using a simple example here, with everything on one assembly. Once Vector3 is in a different assembly (as it is in Unity), inlining becomes more complex, and those extra function calls might happen and might matter. This is one of the performance issues we addressed recently. The Vector3 operations should all be inlined now, even when they are in a separate assembly.

I guess the bottom line point here is that the seemingly complex code that IL2CPP generates is there for a reason - it stems from the IL code. But in the end, it looks like it does not matter too much for the generated assembly code.

Ok, now for point (2).

All that really matters is the real performance, right? Many of the suggestions here seems to have a real performance benefit (Thanks for the profiling code @Baste!). So we will take some of these suggestions, make the changes, and run them through our performance tests to see what happens.

I think that we can make some improvements here, but I'm not ready to make definitive statements yet because performance analysis is pretty complex!
JoshPeterson, Jun 10, 2020

#42

EZaca, Bunny83, Noisecrime and 3 others like this.
Aka_ToolBuddy

Joined:

Feb 25, 2014

Posts:

543

Thanks a lot for taking time to answer. Here is my answer to various of your points:

JoshPeterson said: ↑

1. Is IL2CPP generating the best code possible regarding the current Vector3 implementation?
Click to expand...

That's a valid question, and answering it has it benefits, but it is not what I am focused on right now.

JoshPeterson said: ↑

2. Can the Vector3 C# implementation be improved to provide better performance?
Click to expand...

JoshPeterson said: ↑

For (2), I believe the answer is also yes, but this exploration requires profiling, which is a bit more difficult to do on the forums
Click to expand...

JoshPeterson said: ↑

I think that we can make some improvements here, but I'm not ready to make definitive statements yet because performance analysis is pretty complex!
Click to expand...

From my analysis of the IL instructions, my personal tests, the tests of other people and users of Frame Rate Booster for a couple of years, I am convinced the answer is yes. I get that changing anything in Unity is a big responsability, and you need to do your own tests. There is nothing wrong about that. And if you do end up including these changes in Unity, please consider all the other optims of the same kind you can find in Frame Rate Booster

For me, the issue I am trying to solve is neither point 1 or 2, but a point 3 which is "can IL2CPP translate the custom optimized Vector3 C# implementation in a more efficient way? like suggested here"

JoshPeterson said: ↑

Here is the IL code (using ILSpy)
Click to expand...

The IL code you posted is a bit different than mine. Please when exploring this subject, take a look also at what Frame Rate Booster produces, and what has been posted here

JoshPeterson said: ↑

The generated assembly code for the two implementations looks pretty similar, and this makes me happy - in the end we want to add two vectors, so hopefully the code that ends up running is pretty much the same.
Click to expand...

The limitation of my expertise in c++ stops me from contradicting you, but in my tests the two versions did not run at a similar frame rate, one being 50% slower than the other. Maybe you will encounter this once you will test within the actual context of Unity, maybe not, I don't know. If you want, I can PM you the wip version of Frame Rate Booster that is compatible with IL2CPP so you can hopefully see the slowdowns I encountered.

JoshPeterson said: ↑

I guess the bottom line point here is that the seemingly complex code that IL2CPP generates is there for a reason - it stems from the IL code.
Click to expand...

JoshPeterson said: ↑

The key thing to understand about IL2CPP is that it is not really making many optimizations. Its job is to translate the IL code into C++ code, then it relies on the C++ compiler to optimize it.
Click to expand...

If the point just above is solved, then I completely agree, the cpp code can be complex, as long as the c++ compiler handles it it's ok. But in the other case, are we agreeing that it is possible to automatically translate the same IL instructions with less cpp instructions? Like suggested here

Thanks again for your time and efforts

Aka_ToolBuddy, Jun 10, 2020

#43

Noisecrime likes this.
Peter77

QA Jesus

Joined:

Jun 12, 2013

Posts:

6,609

Aka_ToolBuddy said: ↑

Maybe you will encounter this once you will test within the actual context of Unity, maybe not, I don't know.
Click to expand...

Isolated tests are often misleading from an optimizations point of view. Josh gave the example with method calls that are inlined in his test. It would be more meaningful if it's tested with an entire game with non trivial complexity.

Peter77, Jun 10, 2020

#44

Aka_ToolBuddy likes this.
JoshPeterson

Unity Technologies

Joined:

Jul 21, 2014

Posts:

6,931

Aka_ToolBuddy said: ↑

For me, the issue I am trying to solve is neither point 1 or 2, but a point 3 which is "can IL2CPP translate the custom optimized Vector3 C# implementation in a more efficient way? like suggested here"
Click to expand...

No, IL2CPP cannot translate this code more efficiently. The IL code specifically requires that the address of each field be used.

JoshPeterson, Jun 10, 2020

#45
JoshPeterson

Unity Technologies

Joined:

Jul 21, 2014

Posts:

6,931

Aka_ToolBuddy said: ↑

The IL code you posted is a bit different than mine. Please when exploring this subject, take a look also at what Frame Rate Booster produces, and what has been posted here
Click to expand...

Interesting - I missed the IL code post earlier in this thread. Thanks for pointing it out.

After looking at it a bit, it is slightly different, but still pretty close in both cases. Anyway, we will make the changes and profile to see what happens.

JoshPeterson, Jun 10, 2020

#46

PraetorBlue and Peter77 like this.
JoshPeterson

Unity Technologies

Joined:

Jul 21, 2014

Posts:

6,931

Peter77 said: ↑

Isolated tests are often misleading from an optimizations point of view. Josh gave the example with method calls that are inlined in his test. It would be more meaningful if it's tested with an entire game with non trivial complexity.
Click to expand...

This is kind of what I was getting at. Prior to Unity 2020.2 (I don't recall the exact version), many of these Vector3 math operations were not inlined, but now they are.

JoshPeterson, Jun 10, 2020

#47

Peter77 likes this.
JoshPeterson

Unity Technologies

Joined:

Jul 21, 2014

Posts:

6,931

Aka_ToolBuddy said: ↑

The limitation of my expertise in c++ stops me from contradicting you, but in my tests the two versions did not run at a similar frame rate, one being 50% slower than the other.
Click to expand...

Yes, I completely believe this. Performance is complex, and just looking at generated assembly code is not usually enough to understand it. I just wanted to point about that the place for improvement here is in the C+ code, not in the way that IL2CPP translates it.

If that said, the performance improvement is the most important part, so that is where we need to focus.

JoshPeterson, Jun 10, 2020

#48

Noisecrime likes this.
Aka_ToolBuddy

Joined:

Feb 25, 2014

Posts:

543

JoshPeterson said: ↑

No, IL2CPP cannot translate this code more efficiently. The IL code specifically requires that the address of each field be used.
Click to expand...

You convinced me. I took more time to dig deeper in this, and I am convinced. Thanks for your patience with me.

Aka_ToolBuddy, Jun 10, 2020

#49

atomicjoe and Noisecrime like this.
Aka_ToolBuddy

Joined:

Feb 25, 2014

Posts:

543
JoshPeterson said: ↑

Interesting - I missed the IL code post earlier in this thread. Thanks for pointing it out.

After looking at it a bit, it is slightly different, but still pretty close in both cases. Anyway, we will make the changes and profile to see what happens.
Click to expand...

To make it easier for everyone, here is the IL instructions based on Unity 2019.3.6f1 of:

The default addition implementation

Code (CSharp):

.method public hidebysig specialname static

valuetype UnityEngine.Vector3 op_Addition (

valuetype UnityEngine.Vector3 a,

valuetype UnityEngine.Vector3 b

) cil managed

{

// Method begins at RVA 0x5270

// Code size 50 (0x32)

.maxstack 4

.locals init (

[0] valuetype UnityEngine.Vector3

)

IL_0000: nop

IL_0001: ldarg.0

IL_0002: ldfld float32 UnityEngine.Vector3::x

IL_0007: ldarg.1

IL_0008: ldfld float32 UnityEngine.Vector3::x

IL_000d: add

IL_000e: ldarg.0

IL_000f: ldfld float32 UnityEngine.Vector3::y

IL_0014: ldarg.1

IL_0015: ldfld float32 UnityEngine.Vector3::y

IL_001a: add

IL_001b: ldarg.0

IL_001c: ldfld float32 UnityEngine.Vector3::z

IL_0021: ldarg.1

IL_0022: ldfld float32 UnityEngine.Vector3::z

IL_0027: add

IL_0028: newobj instance void UnityEngine.Vector3::.ctor(float32, float32, float32)

IL_002d: stloc.0

IL_002e: br.s IL_0030

IL_0030: ldloc.0

IL_0031: ret

} // end of method Vector3::op_Addition

Frame Rate Booster's one

Code (CSharp):

.method public hidebysig specialname static

valuetype UnityEngine.Vector3 op_Addition (

valuetype UnityEngine.Vector3 a,

valuetype UnityEngine.Vector3 b

) cil managed

{

// Method begins at RVA 0x5270

// Code size 53 (0x35)

.maxstack 3

IL_0000: ldarga.s a

IL_0002: ldflda float32 UnityEngine.Vector3::x

IL_0007: dup

IL_0008: ldind.r4

IL_0009: ldarg.1

IL_000a: ldfld float32 UnityEngine.Vector3::x

IL_000f: add

IL_0010: stind.r4

IL_0011: ldarga.s a

IL_0013: ldflda float32 UnityEngine.Vector3::y

IL_0018: dup

IL_0019: ldind.r4

IL_001a: ldarg.1

IL_001b: ldfld float32 UnityEngine.Vector3::y

IL_0020: add

IL_0021: stind.r4

IL_0022: ldarga.s a

IL_0024: ldflda float32 UnityEngine.Vector3::z

IL_0029: dup

IL_002a: ldind.r4

IL_002b: ldarg.1

IL_002c: ldfld float32 UnityEngine.Vector3::z

IL_0031: add

IL_0032: stind.r4

IL_0033: ldarg.0

IL_0034: ret

} // end of method Vector3::op_Addition

Both are very similar to your result
Aka_ToolBuddy, Jun 10, 2020

#50

(You must log in or sign up to reply here.)

Page 1 of 2