Search Unity

  1. Megacity Metro Demo now available. Download now.
    Dismiss Notice
  2. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

No More "Cool" Features Please Until Crippling Stuff Is Addressed

Discussion in 'General Discussion' started by Games-Foundry, Nov 15, 2012.

  1. ronan-thibaudau

    ronan-thibaudau

    Joined:
    Jun 29, 2012
    Posts:
    1,722
    Don't worry i had got that, was replying to the person you quoted but quoting you seemed to make more sense the way i was wording it hehe:)
     
  2. Ocid

    Ocid

    Joined:
    Feb 9, 2011
    Posts:
    476
    hehe. Gotcha. No worries.
     
  3. alexzzzz

    alexzzzz

    Joined:
    Nov 20, 2010
    Posts:
    1,447
    You can do it right now. I have tested Mono.Simd.dll that hides deep inside Unity's folder. It works.
     
  4. ronan-thibaudau

    ronan-thibaudau

    Joined:
    Jun 29, 2012
    Posts:
    1,722
    How did you test it? (just so you know, it "works" on everything, but unless there is runtime support, it doesn't do simd, it just emulates the instructions).
    Did you test it by comparing it to the raw non SIMD code or just test that "it works"? If the later, it probably doesn't work at all this is the expected behavior.
     
  5. alexzzzz

    alexzzzz

    Joined:
    Nov 20, 2010
    Posts:
    1,447
    This is a test script:
    Code (csharp):
    1. using System.Threading;
    2. using Mono.Simd;
    3. using UnityEngine;
    4.  
    5. public class Test : MonoBehaviour
    6. {
    7.     private void Start()
    8.     {
    9.         Thread.Sleep(10000);
    10.  
    11.         Vector4f a = new Vector4f(1, 2, 3, 0);
    12.         Vector4f b = new Vector4f(10, 0, 0, 0);
    13.         Vector4f c = new Vector4f(4, 5, 6, 0);
    14.  
    15.         Vector4f cross;
    16.         Cross(ref a, ref b, out cross);
    17.         float dot = Dot(ref cross, ref c);
    18.  
    19.         Debug.Log(string.Format("Cross: {0}, Dot: {1}", cross, dot));
    20.         Verify();
    21.     }
    22.  
    23.     private static void Verify()
    24.     {
    25.         Vector3 a = new Vector3(1, 2, 3);
    26.         Vector3 b = new Vector3(10, 0, 0);
    27.         Vector3 c = new Vector3(4, 5, 6);
    28.    
    29.         var cross = Vector3.Cross(a, b);
    30.         var dot = Vector3.Dot(cross, c);
    31.  
    32.         Debug.Log(string.Format("Cross: {0}, Dot: {1}", cross, dot));
    33.     }
    34.  
    35.     private static float Dot(ref Vector4f vector1, ref Vector4f vector2)
    36.     {
    37.         Vector4f t = vector1 * vector2;
    38.         t = t.HorizontalAdd(t);
    39.         t = t.HorizontalAdd(t);
    40.         return t.X;
    41.     }
    42.  
    43.     private static void Cross(ref Vector4f a, ref Vector4f b, out Vector4f result)
    44.     {
    45.         result = (a * b.Shuffle((ShuffleSel)0xc9) - b * a.Shuffle((ShuffleSel)0xc9)).Shuffle((ShuffleSel)0xc9);
    46.     }
    47. }
    And this is what Start() actually compiles into: http://pastebin.com/RZ4LEsZH

    I don't remember exactly how much faster these Dot and Cross products are compared to UnityEngine.Vector3, but times faster.

    Thread.Sleep call gives me some time to attach a debugger to the running process and easily find exactly the place I want to examine.

    PS
    Here is how Dot() looks after I have moved Thread.Sleep in it: http://pastebin.com/iDu2kwQ1
    and then Cross(): http://pastebin.com/bre2WKVG
     
    Last edited: Nov 13, 2013
  6. alexzzzz

    alexzzzz

    Joined:
    Nov 20, 2010
    Posts:
    1,447
    Another test

    There are three arrays of floats (let's say, height maps or some other maps for a procedurally generated terrain). Let array3 = array1 + array2.

    Source code: http://pastebin.com/XeKtVJ5g

    Results:
    Code (csharp):
    1. 1024x1024 maps
    2.  
    3. No SSE: 2902 microseconds
    4. SSE: 1146 microseconds
    5. Sums: no SSE = 1048972, SSE = 1048972
    6.  
    7. ----------------------------------------------
    8. 2048x2048 maps
    9.  
    10. No SSE: 7928 microseconds
    11. SSE: 4113 microseconds
    12. Sums: no SSE = 4193830, SSE = 4193830
    13.  
    14. ----------------------------------------------
    15. 4096x4096 maps
    16.  
    17. No SSE: 31883 microseconds
    18. SSE: 16630 microseconds
    19. Sums: no SSE = 1.677484E+07, SSE = 1.677484E+07
     
    Last edited: Nov 13, 2013
  7. RvBGames

    RvBGames

    Joined:
    Oct 22, 2013
    Posts:
    141
    The below is based on Cort Stratton's article on SSE Optimization, which was written in 2002.

    The test program performs a basic transform using the matrix * vector function.

    The test results show that using Intel’s SIMD instructions on a 1.7 GHz Pentium processor, 1,000,000 transformations can be done in 36 milliseconds consistently.

    Using straight C code without any optimizations takes 2.46 seconds for the same operations.

    Code (csharp):
    1.  
    2. Vector4f MatrixMultiply1( Matrix4f m, Vector4f vin )
    3. {
    4.     float v0 = m.elts[0][0] * vin[0]
    5.              + m.elts[0][1] * vin[1]
    6.              + m.elts[0][2] * vin[2]
    7.              + m.elts[0][3] * vin[3];
    8.  
    9.  
    10.     float v1 = m.elts[1][0] * vin[0]
    11.              + m.elts[1][1] * vin[1]
    12.              + m.elts[1][2] * vin[2]
    13.              + m.elts[1][3] * vin[3];
    14.  
    15.  
    16.     float v2 = m.elts[2][0] * vin[0]
    17.              + m.elts[2][1] * vin[1]
    18.              + m.elts[2][2] * vin[2]
    19.              + m.elts[2][3] * vin[3];
    20.  
    21.  
    22.     float v3 = m.elts[3][0] * vin[0]
    23.              + m.elts[3][1] * vin[1]
    24.              + m.elts[3][2] * vin[2]
    25.              + m.elts[3][3] * vin[3];
    26.  
    27.  
    28.     return Vector4f( v0, v1, v2, v3 );
    29. }
    30.  
    A revised implementation turns out to be about 5 times faster. The major change is that the function uses pointers to the components of the vectors instead of indirect access offered via the class.

    Code (csharp):
    1.  
    2. void MatrixMultiply2( Matrix4f m, Vector4f* vin, Vector4f* vout )
    3. {
    4.     float* in  = vin->Ref();
    5.     float* out = vout->Ref();
    6.  
    7.     out[0] = m.elts[0][0] * in[0]
    8.            + m.elts[0][1] * in[1]
    9.            + m.elts[0][2] * in[2]
    10.            + m.elts[0][3] * in[3];
    11.  
    12.  
    13.     out[1] = m.elts[1][0] * in[0]
    14.            + m.elts[1][1] * in[1]
    15.            + m.elts[1][2] * in[2]
    16.            + m.elts[1][3] * in[3];
    17.  
    18.  
    19.     out[2] = m.elts[2][0] * in[0]
    20.            + m.elts[2][1] * in[1]
    21.            + m.elts[2][2] * in[2]
    22.            + m.elts[2][3] * in[3];
    23.  
    24.  
    25.     out[3] = m.elts[3][0] * in[0]
    26.            + m.elts[3][1] * in[1]
    27.            + m.elts[3][2] * in[2]
    28.            + m.elts[3][3] * in[3];
    29. }
    30.  
    The next performance gain is achieved by utilizing Intel’s Single Instruction, Multiple Data extensions to perform the transformation. The 1,000,000 vertices were transformed in about 290 milliseconds.

    Code (csharp):
    1.  
    2. void MatrixMultiply3( Matrix4f m, Vector4f* vin, Vector4f* vout )
    3. {
    4.     // Get a pointer to the elements of m
    5.     float* row0 = m.Ref();
    6.  
    7.     __asm
    8.     {
    9.         mov     esi,  vin
    10.         mov     edi,  vout
    11.  
    12.         // load columns of matrix into xmm4-7
    13.         mov     edx,  row0
    14.         movups  xmm4, [edx]
    15.         movups  xmm5, [edx+0x10]
    16.         movups  xmm6, [edx+0x20]
    17.         movups  xmm7, [edx+0x30]
    18.  
    19.         // load vertex into xmm0.
    20.         movups  xmm0, [esi]
    21.  
    22.         // initialize output (xmm2)
    23.         xorps   xmm2, xmm2
    24.  
    25.         // broadcast v.X into xmm1
    26.         // multiply xmm1 by column 1 of the matrix (xmm4)
    27.         // add xmm1 to the total
    28.         movups  xmm1, xmm0
    29.         shufps  xmm1, xmm1, 0x00
    30.         mulps   xmm1, xmm4
    31.         addps   xmm2, xmm1
    32.  
    33.         // repeat the process for v.Y
    34.         movups  xmm1, xmm0
    35.         shufps  xmm1, xmm1, 0x55
    36.         mulps   xmm1, xmm5
    37.         addps   xmm2, xmm1
    38.  
    39.         // repeat the process for v.Z
    40.         movups  xmm1, xmm0
    41.         shufps  xmm1, xmm1, 0xAA
    42.         mulps   xmm1, xmm6
    43.         addps   xmm2, xmm1
    44.        
    45.         // repeat the process for v.W
    46.         movups  xmm1, xmm0
    47.         shufps  xmm1, xmm1, 0xFF
    48.         mulps   xmm1, xmm7
    49.         addps   xmm2, xmm1
    50.  
    51.         // write the results to vout
    52.         movups  [edi], xmm2
    53.     }
    54. }
    55.  
    However the biggest gain is achieved when the transformations are batched, that is all the vertices are transformed at one time. Doing so takes about 36 milliseconds for the 1,000,000 vertices.

    Code (csharp):
    1.  
    2. void BatchMultiply1(Matrix4f &m, Vector4f *vin, Vector4f *vout, int len)
    3. {
    4.     static const int vecSize = sizeof(Vector4f);
    5.  
    6.     // transpose the matrix into the xmm4-7
    7.     m.TransposeIntoXMM();
    8.  
    9.     __asm
    10.     {
    11.         mov     esi, vin
    12.         mov     edi, vout
    13.         mov     ecx, len
    14.  
    15. BM1_START:
    16.  
    17.         // load the next vertex into xmm0, and advance the input pointer
    18.         movups  xmm0, [esi]
    19.         add     esi,  vecSize
    20.  
    21.         // initialize output (xmm2)
    22.         xorps   xmm2, xmm2
    23.  
    24.         //
    25.         // Multiply as above
    26.         //
    27.  
    28.         // write the results to vout, and advance the output pointer
    29.         movups  [edi], xmm2
    30.  
    31.         // advance the output pointer
    32.         add     edi, vecSize
    33.         dec     ecx
    34.         jnz     BM1_START
    35.     }
    36. }
    37.  
    Intel’s SIMD extensions expect the matrix to be in column-major order. Since the test program is using row major, a function was added to transpose it when needed.

    Btw, I did convert this (what I could) to ARM assembly and had decent results.
     
    Last edited: Nov 15, 2013
  8. RvBGames

    RvBGames

    Joined:
    Oct 22, 2013
    Posts:
    141
    Add optimizers don't do the best job. Maybe better than most, but way off from optimal.
     
  9. superpig

    superpig

    Drink more water! Unity Technologies

    Joined:
    Jan 16, 2011
    Posts:
    4,649
    Could we move the SSE discussion to another thread, please? Beyond "Unity should support SIMD, which it possibly is doing already" the rest of this doesn't seem relevant to the topic of the thread.
     
  10. im

    im

    Joined:
    Jan 17, 2013
    Posts:
    1,408
  11. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Last edited: Nov 16, 2013
  12. Murgilod

    Murgilod

    Joined:
    Nov 12, 2013
    Posts:
    10,083
    That's from nearly half a year ago and a 64 bit editor has been high priority for over a year now. These sort of posts would be nice if Unity had a faster turnaround time, but as it stands, we've been waiting on a 64 bit editor since long after 64 bit operating systems were mainstream and a new GUI since it was announced way back in 3.x
     
  13. iivo_k

    iivo_k

    Joined:
    Jan 28, 2013
    Posts:
    314
    Actually foreaching a generic list in a C# script compiled by Unity does allocate memory, so you shouldn't use foreach. With .NET or newer Mono compilers it works as you described, but not with Unity's Mono compiler.

    The following C# script allocates 24 bytes of memory per update:

    Code (csharp):
    1. using UnityEngine;
    2.  
    3. public class ListForEachTest : MonoBehaviour {
    4.  
    5.     private System.Collections.Generic.List<int> list;
    6.  
    7.     // Use this for initialization
    8.     void Start () {
    9.         list = new System.Collections.Generic.List<int>(2);
    10.         list.Add(1);
    11.         list.Add(2);
    12.     }
    13.    
    14.     // Update is called once per frame
    15.     void Update () {
    16.         foreach (int val in list)
    17.         {
    18.  
    19.         }
    20.     }
    21. }
     
  14. meta87

    meta87

    Joined:
    Dec 31, 2012
    Posts:
    254
  15. Smooth-P

    Smooth-P

    Joined:
    Sep 15, 2012
    Posts:
    214
    In non-ancient .Net versions there is Tuple which works well for this purpose (passing around, well, tuples of data). Of course those TupleS are classes and are designed for a competent, generational GC, but there is also KeyValuePair<K, V> which is a struct, or nasty ref / out parameters. In my codebase I've rolled my own Option<T>, Either<L, R>, and Tuple<T1, ..., Tn>S that are immutable structs with readonly field access.

    But I think the general opinion is that a List<T> is the best choice since it can potentially grow to fit the results, thus simplifying the API and end user code. And anyone doing anything semi-demanding in Unity and understands the basics of GC is probably already using pre-allocated List<T>S in many places anyway.
     
    Last edited: Dec 4, 2013
  16. Ark_kun

    Ark_kun

    Joined:
    Dec 19, 2013
    Posts:
    1
    Unity continues using a ~4 years old core while tightening the restrictions and flexing the muscles?
    Well, I have some bad news for you, developers. The ship is surely going forward. But the important thing is that it's going down.
     
  17. Dantus

    Dantus

    Joined:
    Oct 21, 2009
    Posts:
    5,667
    Congratulations for your first post!
     
    MD_Reptile likes this.