Search Unity

Where does it all go?

Discussion in 'General Discussion' started by Arowx, Feb 11, 2017.

  1. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    In theory we have CPU's that can do millions of things every frame.

    e.g:

    A CPU running at 3.5 Ghz has time to do about 58,333,333.3 operations a frame.

    And when you run a modern multi core cpu you can multiply that by the number of cores or threads.

    Who's stealing/squandering all that gaming processing power?

    You could say hey it's a 3D game you're reducing your processing power by a third but chips have instruction sets to handle vector/matrix operations even batches of them (MMX, SSE, SSE2, SIMD ect).

    And even at a third you still have about 19,444,444 3D operations a frame?

    Example test code:

    Code (CSharp):
    1. using System.Collections;
    2. using System.Collections.Generic;
    3. using UnityEngine;
    4. using System.Diagnostics;
    5.  
    6. public class OneFrameTest : MonoBehaviour {
    7.  
    8.     List<int> counters;
    9.     Stopwatch timer;
    10.  
    11.     void Start () {
    12.         counters = new List<int>(frames);
    13.         timer = new Stopwatch();
    14.      
    15.     }
    16.  
    17.     int frames = 100;
    18.     int counter;
    19.  
    20.     void Update() {
    21.  
    22.         if (frames > 0)
    23.         {
    24.             timer.Reset();
    25.             timer.Start();
    26.  
    27.             counter = 0;
    28.  
    29.             do
    30.             {
    31.                 counter++;
    32.             } while (timer.ElapsedMilliseconds < 16l);
    33.  
    34.  
    35.          
    36.  
    37.             counters.Add(counter);
    38.         }
    39.      
    40.         if (frames == 0)
    41.         {
    42.             for (int i = 0; i < counters.Count; i++)
    43.             {
    44.                 print(i + " : " + counters[i]);
    45.             }
    46.         }
    47.  
    48.         if (frames >= 0) frames--;
    49.     }
    50. }
    Results on my machine in Unity:

    Code (CSharp):
    1.  
    2. 23 : 191496
    3. 24 : 191706
    4. 25 : 191835
    5. 26 : 191953
    6. 27 : 191625
    7. 28 : 191891
    8. 29 : 191696
    9. 30 : 187944
    10. 31 : 191885
    11. 32 : 190361
    12. 33 : 190641
    13. 34 : 191200
    14. 35 : 191646
    15. 36 : 191955
    16. 37 : 191502
    17. 38 : 190452
    18. 39 : 189946
    19. 40 : 203891
    20. 41 : 191700
    21. 42 : 191511
    22. 43 : 192046
    23. 44 : 191733
    24. 45 : 188930
    25. 46 : 197201
    26. 47 : 189692
    27. 48 : 190824
    28. 49 : 195638
    29. 50 : 187880
    30. 51 : 187643
    31. 52 : 190388
    32. 53 : 190285
    33. 54 : 189610
    34. 55 : 189504
    35. 56 : 198415
    36. 57 : 190303
    37. 58 : 191159
    38. 59 : 188409
    39. 60 : 189778
    40. 61 : 196251
    41. 62 : 191237
    42. 63 : 198426
    43. 64 : 188969
    44. 65 : 190379
    45. 66 : 188860
    46. 67 : 191096
    47. 68 : 191153
    48. 69 : 188934
    49. 70 : 191743
    50. 71 : 190792
    51. 72 : 190204
    52. 73 : 188921
    53. 74 : 191005
    54. 75 : 188810
    55. 76 : 188735
    56. 77 : 189483
    57. 78 : 190817
    58. 79 : 189095
    59. 80 : 187751
    60. 81 : 189009
    61. 82 : 198746
    62. 83 : 188778
    63. 84 : 188894
    64. 85 : 190369
    65. 86 : 191731
    66. 87 : 190610
    67. 88 : 191025
    68. 89 : 189469
    69. 90 : 188812
    70. 91 : 189051
    71. 92 : 190186
    72. 93 : 191331
    73. 94 : 189507
    74. 95 : 189101
    75. 96 : 194340
    76. 97 : 190848
    77. 98 : 190904
    78. 99 : 190922
    Around 190,922 additions and comparisons per a 16ms frame or about 381,844 ops.

    Which is a lot less than the 58,333,333 ops a 3.5 Ghz cpu should be able to do in 16ms.

    OK there is the OS, Unity, .Net Mono then the CPU but this looks like I'm only seeing 1/152th of the raw gaming power I should be getting.

    So, where does it all go?*

    Or is there a good reason MS are bringing out a game mode to the OS?
     
    TechDeveloper and JasonBricco like this.
  2. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Or What are we doing with the spare 141,173,635 ZX Spectrums? [Silly Thought]

    If you take a modern CPU running at 3.5 Ghz with about 1.2 billion transistors, you have about 141,176,000 ZX Spectrums [ZX] (Z80 8,500 transistors @ 3.5 Mhz).

    Now the ZX had a 256 x 192 x 3 bit display.

    To run a screen a 1920 x 1080 @ 32 bits would only take about 473 ZX's.

    The ZX display ran at 24 Hz. To run at 120 Hz we need 5 x.

    So that's 2,365 ZX's to run a modern game display.

    That leaves us with 141,173,635 ZX Spectrums.

    So what are we doing with all those spare ZX Spectrums?
     
  3. Ryiah

    Ryiah

    Joined:
    Oct 11, 2012
    Posts:
    21,175
  4. Ryiah

    Ryiah

    Joined:
    Oct 11, 2012
    Posts:
    21,175
    We're making games that are not restricted to character graphics (the original ZX had a resolution of 256 by 192 but it was restricted to a color grid of 32 by 24 meaning you were drawing modified characters for your game), sound that is not restricted to ten octaves of beep, games that are no longer restricted to less memory than a smart card, etc.

    https://en.wikipedia.org/wiki/ZX_Spectrum
    https://en.wikipedia.org/wiki/Graphic_character

    Memory card with 256KB. Just for comparison the Commodore 1541 gave 170KB of storage per side.

    http://www.acs.com.hk/en/products/307/acos3x-express-microprocessor-card-contact/
     
    Last edited: Feb 11, 2017
  5. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,569
    Number of CPU operations is not directly correlated to clock rate. An operation may take more than one than clock cycle, on other hand, cpu may perform multiple operations ahead of time , then there are delays when dealing with IO, fetching data from memory, etc.

    You can't say "CPU can do 50 million operations per second!".

    Also, you don't really know into what kind of code your program is turned after compilation. Measuring just addition and floating point comparison (which is what your code does) is rather pointless.

    Modern computers are not zx spectrums.


    ZX spectrum (64k version) didn't really have a beeper, if I remember it correctly. There was a speaker, and you could set its level to high (1) or low (0), changing the state produced a click. If you wanted to beep, you'd to do it manually, meaning you'd need to busywait required number of milliseconds before toggling high/low state of the speaker. Which means you couldn't really do anything useful while your program tried to play any music. Spectrum also didn't have a system clock, meaning you couldn't just measure how much time has passed, and needed to first prepare a table for notes (how many loops should I busy wait to produce a note of a correct frequency). Also, obviously, no polyphony of any kind. You could, of course try to put some sort of short "wave" into memory, but with 64k bytes of memory total with 16k of it being reserved for the ROM, and another chunk of ram reserved for the screen, this wasn't a good idea.
     
  6. GarBenjamin

    GarBenjamin

    Joined:
    Dec 26, 2013
    Posts:
    7,441
    Not sure about the 58 million operations per frame estimate but surely yes these modern machines should be able to do far more than we are getting from them. That's just common sense really.

    I think @Ryiah hit it when she mentioned the bloat factor. Over the years as computing power has increased the software side has just slapped on layer upon layer.

    First, we have all of the layers in the OS itself. Basically something like Windows is just a bunch of layers upon layers from the front end people work with all the way down ultimately to the old DOS and device drivers in simplified terms. And there is just so much of it not just vertically but also horizontally which leads to...

    Second, we have all of the crap going on in the OS itself. This is why gamers often use things like Razer Cortex: Game Booster to squeeze out a little more performance by just temporarily disabling some of the many unnecessary services that are running on the typical computer. Things just taking resources (RAM, CPU time and perhaps even performing disk IO).

    Third, using something like Unity again you are dealing with layers of abstraction. And overhead from not only the layers themselves (who knows if you are working 10 layers of abstraction away from the actual end result or 50 layers or more) but also the overhead from the C++ -> C# calls (such as Update which can be seen the more gameobjects you have each having their own monobehaviour with an Update called once per frame... basically think back about the Space Invaders games we each made last summer to compare)... and then physics and other things that are more complex than is needed in many cases. All of this stuff takes time. And this is not just Unity but most modern game engines.

    Finally, in addition to all of that you have the layers upon layers of abstraction that most developers swear by. Meaning anything that is happening is abstracted away to some degree.

    It's basically the difference of... you want to take a load of limbs out of your yard and haul it off a 1/2 mile away. Instead of you simply loading the truck and driving it over and being done with it you drive the truck a block down the road and pull into a neighbors driveway. You then wait as they hop in the truck and head down the road another couple of blocks where they pull into a driveway. Get out. Another person gets into the truck and drives off. And this continues until the destination is reached and finally the truck is unloaded. Then everything is reversed until finally the truck is back by you, you jump in and return to your home.

    Of course, all of these things are happening incredibly fast unless there is some kind of resource waiting or other delay going on... and there may well be real benefits gained from using such an indirect process if only to make the dev feel like they engineered a beautiful system... but the point is it certainly isn't efficient.

    Basically, the best you can do is to focus on the last two. Choose as streamlined of an engine / framework as possible and design as streamlined of an architecture for your games (or other software) as possible. And just doing those things can give you a hell of a performance boost.

    If Microsoft, etc did it at the OS level we would see a pretty nice performance boost I think. If they had some kind of true GameMode that kill off say 95% of the crap while a game is running... I'd love to see that boost. Of course, they will still have all of the layers of abstraction to pass through. To really see you would need to get a modern machine and code up some very minimal OS. Or basically be something like how consoles used to be. These days even the consoles seem to have these huge bloated Operating Systems (comparatively speaking).


    @Arowx why not try using the Razer Cortex: Game Booster and see how performance is boosted. On your standalone exe. Although it should be able to be used to boost even the Unity Editor performance I would think. Don't expect some massive boost like tripling the speed but there definitely should be a boost just from killing all of that unnecessary crap while your perf test or Unity is running.
     
    Last edited: Feb 11, 2017
    RavenOfCode, Ryiah and JasonBricco like this.
  7. imaginaryhuman

    imaginaryhuman

    Joined:
    Mar 21, 2010
    Posts:
    5,834
    There have been a lot of higher-level advancements which seem like they are nothing major but actually take a lot of processing power and levels of abstraction to accomplish. I remember when I first saw that Unity was doing all of its collision detection using full-blown physics calculations, I thought to myself, oh my God, that is such a huge overkill, so extremely wasteful... especially lets say to do 2D games, even with the 2D physics now, some years ago if you told someone you'd be simulating an entire environment with realistic physics calculations for collision detection, they'd think you were nuts. It's a lot of processing required to pull that off. And yet here we are now just taking it for granted that you can 'easily' achieve collision detection by adding a simple physics collider to something. And this is now used in games where physics isn't even really needed, like for collisions on a tile map where everything is rectangular. So we've gotten really used to having this high-level developer-friendly features and ways of doing things that save time, but then to pull of that savings in time there is more realtime calculations needed. There's a whole lot of convenience going on there and 'ease of use' and 'taking the pain out of development' etc.

    That said I am still puzzled why certain operations in any software are taking so damn long when the cpu should be cranking out billions of calculations per second. As more cpu resources become available, they seem to be mostly gobbled up by lazy programming, a complete ignorance about optimization "because why bother, cpus are fast", and very high-level functioning, feature bloat, generalization and abstraction etc, which all add up to huge amounts of 'unseen stuff' that has to be done to pull that off.
     
    Socrates and GarBenjamin like this.
  8. imaginaryhuman

    imaginaryhuman

    Joined:
    Mar 21, 2010
    Posts:
    5,834
    You could always ditch Unity and go program a large 3D game in raw assembly language, and enjoy the 10-100x performance boost, plus take many many many years to finish it.
     
  9. Kronnect

    Kronnect

    Joined:
    Nov 16, 2014
    Posts:
    2,905
    Exactly.
     
  10. GarBenjamin

    GarBenjamin

    Joined:
    Dec 26, 2013
    Posts:
    7,441
    That's an extreme... be fun to do just from a technical perspective but there is a large area in between the two extremes.

    First it'd be interesting for @Arowx to disable as much stuff in the Unity project as possible... if possible. Then test again.

    Next use C++ and AGK library or C and Allegro library etc to perform the same test and see what the difference is.

    Then maybe just create a straightforward C/C++ console app and perform the test.

    Actually I'll convert @Arowx 's code to a couple of different things and see how it does.

    But basically it is just the layers. More specifically what I mean by layers... are unnecessary code outside of our control being involved (in some way like just it is also running to some degree even if just performing checks to decide nothing needs to be done) and abstracted (over generalized) code being involved.

    I have Blitz 3D installed on my new laptop... and GLBasic which is very fast. And Monkey X and AGK. These are all just straightforward programming languages with game oriented APIs. Although one or two have built-in physics support I don't believe it is engrained to the core and just there all the time. Might be wrong about that. But let's find out what if any differences there are between a lighter weight API (that is likely using minimal, if any, abstraction layers) and a full blown game engine when running code that would seem to have nothing to do with using a game engine's features at all.

    EDIT: Actually the better thing to do is @Arowx just create your test program purely in C# outside of Unity. And see what difference, if any, that makes. You're dealing with generic stuff and overhead layers from both Unity and C# at the moment. See if just using C# on its own has any impact on performance. If performance is basically the same then the issue is likely C# or just all of the other crap mentioned before in the OS.
     
    Last edited: Feb 11, 2017
  11. GarBenjamin

    GarBenjamin

    Joined:
    Dec 26, 2013
    Posts:
    7,441
    Not sure exactly why you're doing things as you're doing them.

    I just wrote a little CPU test doing the same variable counter looping operation in Blitz3D.



    It just times for 1 second to see how many loops are performed and stores that value in an array.
    It only does this 5 times.
    Then the average of those is displayed at the end.

    I then ran this test 5 different times and took the worst and best cases.

    WORST / SLOWEST RUN 37.6 million


    BEST / FASTEST RUN 37.9 million


    Which gives us an average (based on those two runs alone) of 37.75 million loops per second incrementing a single variable each loop.

    This means if targeting 60 fps we can do only about 629,000 operations (equivalent to the overhead of 1 loop iteration plus incrementing 1 variable) per frame.

    I'll knock out a C# test doing the same thing next.
     
    Last edited: Feb 12, 2017
    cyberpunk likes this.
  12. GarBenjamin

    GarBenjamin

    Joined:
    Dec 26, 2013
    Posts:
    7,441
    For C#, I just created a simple Windows form application with a button to click to start the timing test and a label to display the average of the 5 runs.

    I actually should have declared the variables at the form level to stay consistent with the B3D test but it is close enough I am not worrying about it.

    Code (CSharp):
    1. using System;
    2. using System.Collections.Generic;
    3. using System.ComponentModel;
    4. using System.Data;
    5. using System.Drawing;
    6. using System.Linq;
    7. using System.Text;
    8. using System.Threading.Tasks;
    9. using System.Windows.Forms;
    10. using System.Diagnostics;
    11.  
    12. namespace CPUperfTest
    13. {
    14.     public partial class frmCPUTest : Form
    15.     {
    16.         public frmCPUTest()
    17.         {
    18.             InitializeComponent();
    19.         }
    20.  
    21.         private void btnStartTest_Click(object sender, EventArgs e)
    22.         {
    23.             Stopwatch oTimer = new Stopwatch();
    24.             long[] aCounters = new long[5];
    25.             long counter = 0;
    26.             long index = 0;
    27.             long total = 0;
    28.             long average = 0;
    29.  
    30.             lblResult.Text = "calculating...";
    31.             lblResult.Refresh();
    32.             Application.DoEvents();
    33.  
    34.             for (index = 0; index < 5; index++)
    35.             {
    36.                 oTimer.Reset();
    37.                 oTimer.Start();
    38.  
    39.                 while (oTimer.ElapsedMilliseconds < 1000)
    40.                 {
    41.                     counter++;
    42.                 }
    43.  
    44.                 aCounters[index] = counter;
    45.                 oTimer.Stop();
    46.             }
    47.  
    48.             for (index = 0; index < 5; index++)
    49.                 total += aCounters[index];
    50.  
    51.             average = total / 5;
    52.  
    53.             lblResult.Text = "AVERAGE: " + average.ToString() + " per second";
    54.         }
    55.     }
    56. }
    I wasn't surprised to see that C# actually performed better than the old Blitz3D. I was impressed however that Blitz3D actually wasn't that far behind the C# performance!

    WORST / SLOWEST RUN ~44.3 million


    BEST / FASTEST RUN 44.4 million


    Which gives us an average (based on those two runs alone) of 44.35 million loops per second incrementing a single variable each loop.

    This means if targeting 60 fps we can do only about 739,000 operations (equivalent to the overhead of 1 loop iteration plus incrementing 1 variable) per frame.

    -------------------------------------------------------------------------------------------------------------------
    I should mention both the B3D and C-sharp tests were done under the Release profile not the Debug profile.

    Tested on my laptop that has an Intel Core i7 CPU @ 2.50 GHz

    -------------------------------------------------------------------------------------------------------------------

    I'd already say we can see we are obviously losing a hell of a lot of performance someplace. At the same time, both of these B3D and C-Sharp seem like they are capable of performing enough CPU processing per frame to handle games.

    Yet we can also see we might not have nearly as much room as some people think we do on the CPU side... and this is why it is important to reduce the overhead to the bare minimum and that includes the callbacks, the abstraction generalization overhead, etc.

    Especially since we know the graphics upload to the cards, rendering, etc needs some time and for many games these days the bulk of the time most likely.

    2.50 GHz... and yet the original Doom was doing it all actually implementing its own 3D rendering engine plus the gameplay and running on what... ah here we go... 486 processor operating at a minimum of 66 MHz or any Pentium /Athlon processors

    66 MHz ... so let's say it needed a 90 MHz to be largest window / full-screen and smooth. 90 MHz and it was doing it all including the rendering (building the display slice by slice). And these days we have 2.50 GHz (2,500 MHz) (multi-core which you'd think would at least cover the OS overhead if nothing else) CPUs and dedicated Graphics cards to handle all of the rendering as well. And this is why @Arowx is rightly questioning... where in hell is it all going. lol

    I will continue on and profile a couple of more languages.

    It would be great if someone converted the C# test to use Unity and then it will be clear once and for all how much overhead, if any, is added simply by using Unity.
     
    Last edited: Feb 12, 2017
    cyberpunk likes this.
  13. Ryiah

    Ryiah

    Joined:
    Oct 11, 2012
    Posts:
    21,175
    Someone who is extremely familiar with assembly (eg Chris Sawyer) might be able to pull it off in a reasonable time frame but for just about everyone else a good compromise would be C++ (modern C++ compilers are great at optimizing) with custom assembly for the worst performing portions of the program.

    No Man's Sky is one of the more recent examples I know of that used a custom C++ engine with custom assembly for the procedural generation (we found out because the Phenom II lacks SSE4 and that's what they were trying to use for all CPUs causing it to hard crash at the splash screen).
     
  14. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,569
    I think this benchmark provides no useful ifnormation. Mostly because it measures one very specific operation, out of infinite number of combinations of instructions a compiler could produce.


    The good compromise would be C++ maybe with C-style code for portions that need high performance. NO assembly. Assembly is not portable, and because of it, relying on it is a bad idea.
     
    RavenOfCode and GarBenjamin like this.
  15. Ryiah

    Ryiah

    Joined:
    Oct 11, 2012
    Posts:
    21,175
    You could always do both but I agree that assembly is best avoided unless you know exactly what you're doing and are only going to support a very limited selection of platforms.
     
  16. GarBenjamin

    GarBenjamin

    Joined:
    Dec 26, 2013
    Posts:
    7,441
    I agree sort of. I look at it like the test @Arowx came up with represents about the minimum CPU-intensive task. Such a loop should be converted into highly optimized code that is (or was at one time anyway) cached inside the CPU allowing it to run at maximum speed. So looking at it from this angle... this represents the absolute best case scenario possible.

    So there is value in that as it definitely illustrates the whole where does it all go point.

    However, there is the always the consideration of just how much time is being used up to simply get the time value from the system clock. I have no idea if that is part of the CPU these days or truly its own little chip with a line to the CPU. Perhaps that is a bottleneck that is larger than we'd suspect. I am beginning to suspect this may be the case. I just don't know how much overhead it adds.
     
    Last edited: Feb 12, 2017
  17. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,569
    The minimum cpu-intensive task is busy waiting through the message loop every modern program has.
    I believe last time I checked the "framerate" of this on a simple windows application ... that was a LONG time ago on a single core machine, the result was something around 40 000 (forty thousands) "frames" per second. This is a program doing absolutely nothing aside from calculating fps.

    The problem with benchmarks (both yours and ArrowX) is that they are written pretty much with "real mode" thinking in mind. The program does not exist in vacuum, it is not the only thing running on a computer, so the whole "X millions of instructions per second" metric is not very useful.

    Instead there's a message loop, and when the program is not processing any OS message, it has a chance to do something useful.
     
    GarBenjamin likes this.
  18. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,569
    IIRC on windows platform the proper way to get precise timing is using QueryPerformanceCounter routine. There was a popular cpu instruction for measuring time, but it went bust when CPUs became multicore and true parallelism became possible.

    Querying the time may actually be an expensive and an imprecise operation on its own (using GetTickCount on a fast machine could actually tell you that "zero milliseconds passed since the last frame" on some occasions). Rather than verifying "elapsed time", what you'd need to do is to run the loop thorugh, say, one hundred million iterations and see how long it took. Without measuring elapsed time within the iteration.
     
    GarBenjamin likes this.
  19. GarBenjamin

    GarBenjamin

    Joined:
    Dec 26, 2013
    Posts:
    7,441
    Oh yeah I get that. That is just a big part of the overhead what I call the horizontal stuff. Basically computer OS is always doing something. If our program is waiting then other stuff happening... then when ours wants to run may actually need to wait for the other crap to finish up first. And even when we are not waiting Windows is still going to take some slices... timer goes off for this thing or that thing... oh better check for Windows updates or maybe defrag the hard drive, etc.

    Reason I wish we were using like DOS or C64 style approach still. lol


    Yeah this is what I am thinking... previously when I timed stuff I always approach it from the other direction... I get the start time... I then do about a million (or in this case probably a billion) iterations and then get the end time.

    But I wanted follow pattern @Arowx was using. I think we need to change it though because I really think it is this constantly checking the time that is the real bottleneck here.
     
  20. goat

    goat

    Joined:
    Aug 24, 2009
    Posts:
    5,182
    Those high level instruction sets are just protocols and the operations they are talking about, which those higher level protocols get turned into, are still binary operations and more transistors means more binary operations in parallel but still nowhere near as fast as you think although they are making the displays bigger and bigger too using up much of the expanded transistor capacity added.
     
  21. GarBenjamin

    GarBenjamin

    Joined:
    Dec 26, 2013
    Posts:
    7,441
    Okay... using the normal approach... not checking the time passed every dang loop iteration.

    Now we can better see what we have going on...

    Code (CSharp):
    1. using System;
    2. using System.Windows.Forms;
    3. using System.Diagnostics;
    4.  
    5. namespace CPUperfTest
    6. {
    7.     public partial class frmCPUTest : Form
    8.     {
    9.         long[] aCounters = new long[5];
    10.         int counter = 0;
    11.         int index = 0;
    12.         long total = 0;
    13.         long average = 0;
    14.         int temp1 = 0;
    15.         int temp2 = 0;
    16.         int IterationCount;
    17.  
    18.         const int MILLIONS = 73;
    19.  
    20.  
    21.         public frmCPUTest()
    22.         {
    23.             InitializeComponent();
    24.         }
    25.  
    26.         private void btnStartTest_Click(object sender, EventArgs e)
    27.         {
    28.             Stopwatch oTimer = new Stopwatch();
    29.  
    30.             lblResult.Text = "calculating...";
    31.             lblResult.Refresh();
    32.             Application.DoEvents();
    33.  
    34.             IterationCount = MILLIONS * 1000000;
    35.  
    36.             for (index = 0; index < 5; index++)
    37.             {
    38.                 oTimer.Start();
    39.  
    40.                 temp1 = 0;
    41.                 for(counter=0; counter < IterationCount; counter++)
    42.                 {
    43.                     // just some filler to give the CPU something to churn on
    44.                     temp1++;
    45.                     temp1 = temp1 * 2;
    46.                     temp1 = temp1 + 1;
    47.                     temp1 = temp1 - 1;
    48.                     temp1 = temp1 / 2;
    49.  
    50.                     if (temp1 > (IterationCount / 2))
    51.                         temp2 = temp1 / 2;
    52.                     else
    53.                         temp2 = temp1 * 2;
    54.                 }
    55.  
    56.                 oTimer.Stop();
    57.                 aCounters[index] = oTimer.ElapsedMilliseconds;
    58.                 oTimer.Reset();
    59.             }
    60.  
    61.             total = 0;
    62.             for (index = 0; index < 5; index++)
    63.                 total += aCounters[index];
    64.  
    65.             average = total / 5;
    66.  
    67.             lblResult.Text = "AVERAGE: " + average.ToString() + " milliseconds";
    68.         }
    69.     }
    70. }
    You can adjust the MILLIONS constant as needed... I put it at 73 because that puts the time on my computer right about 1 second. Just a little over.

    I also added a bit of actual code inside the loop so the CPU has to actually do something.

    WORST / SLOWEST RUN:


    BEST / FASTEST RUN:


    So the average of these two is 1008 milliseconds... basically 1 second which means if we targeted 60 fps we could get about 1.21 million loops performing some basic processing every frame.

    This is just to get a baseline. The real point here (I think) is to compare how much impact running this inside Unity has, if any.

    I am now going to convert this to B3D and AGK. I am kind of wondering if possibly B3D may do better this time. It is so hard to predict this stuff really. B3D compiles to native machine code. C# is bytecode. Yeah I know compile on demand, etc. etc. just keeping it simple for the sake of it. But B3D is also 32-bit and the C# is probably running as 64-bit.

    Anyway enough of that craziness... time to convert. Hopefully @Arowx or someone runs the above to get a baseline (all you need is create new Windows form project then add the start button named btnStartTest and add the label font size 20 named lblResult... take maybe 15 seconds to do it) AND THEN converts the above to be inside Unity in the meantime.
     
  22. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,569
    Speaking of display sizes, I was quite surprised when I realized that rendering a game in 4k resolution (3840*2160) may easily require 200 megabytes of memory just for the display buffers. (32bit argb final output, 4channel floating point buffers for HDR, 4bytes per pixel depth).
     
  23. GarBenjamin

    GarBenjamin

    Joined:
    Dec 26, 2013
    Posts:
    7,441
    Blitz 3D version


    WORST / SLOWEST RUN:


    BEST / FASTEST RUN:


    So the average of these two is 1023 milliseconds... basically 1 second which means if we targeted 60 fps we could get about 1.19 million loops performing some basic processing every frame. C# is still a tiny bit faster.

    For the heck of it I also converted to AGK BASIC which is what I am currently developing all of my games in. I had to reduce the loop down to around 6 million... about 6.5 million loops can be done in 1 second. Which isn't surprising since AGK BASIC is truly bytecode that is intrepreted as it is executing. So I have about 108,000 loops worth of this kind of processing I can do each frame targeting 60 FPS. I should be fine that is plenty of time do what I need to get done. :)

    I also converted it to GLBasic and it performed slightly worse than B3D. I should convert it to C++. I think that would really zip along at high speed but at this point honestly I am burnt out on this.

    A good chunk of dev time and in the end we didn't check the important things such as layers of abstraction and more importantly haven't checked to see the impact of putting the code inside of Unity.

    Most important of all I have used up my dev time for today and done absolutely nothing on my game project. Tomorrow I will for sure because it has been several days now since I've done anything.

    One thing I think we got out of all of this is that yes... we have powerful machines and yes there is too much other crap sucking up a good chunk of that power. Basically we upgrade to better gear so MS and others can use a good portion of it.

    Another thing we got out of it is confirming that C# is quite darn efficient. At least as far as standard programming goes and compared against some options other than C/C++ and Assembly. Again obviously the more work someone is doing such as jumping through layers, using more generic abstracted stuff and so forth may impact that but those are choices a person makes. C# on its own performs very well. I really am tempted to do a C++ version just to compare.
     
    Last edited: Feb 12, 2017
  24. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Asked the same question on the C# forum and got this rather informative response.
    So I used ILSpy to check out the IL code instructions and this is the inner loop:

    Code (CSharp):
    1. // loop start (head: IL_002b)
    2.             IL_002b: ldarg.0
    3.             IL_002c: dup
    4.             IL_002d: ldfld int32 OneFrameTest::counter
    5.             IL_0032: ldc.i4.1
    6.             IL_0033: add
    7.             IL_0034: stfld int32 OneFrameTest::counter
    8.             IL_0039: ldarg.0
    9.             IL_003a: ldfld int32 OneFrameTest::counter
    10.             IL_003f: ldarg.0
    11.             IL_0040: ldfld int32 OneFrameTest::checkCycle
    12.             IL_0045: rem
    13.             IL_0046: brtrue IL_005c
    14.  
    15.             IL_004b: ldarg.0
    16.             IL_004c: ldfld class [System]System.Diagnostics.Stopwatch OneFrameTest::timer
    17.             IL_0051: callvirt instance int64 [System]System.Diagnostics.Stopwatch::get_ElapsedMilliseconds()
    18.             IL_0056: ldc.i4.s 16
    19.             IL_0058: conv.i8
    20.             IL_0059: cgt
    21.             IL_005b: stloc.0
    22.  
    23.             IL_005c: ldloc.0
    24.             IL_005d: brfalse IL_002b
    25.         // end loop
    Code (CSharp):
    1. do
    2.             {
    3.                 this.counter++;
    4.                 if (this.counter % this.checkCycle == 0)
    5.                 {
    6.                     flag = (this.timer.ElapsedMilliseconds > 16L);
    7.                 }
    8.             }
    9.             while (!flag);
    So you can see that my simple loop generates about 20 IL ops two of which are external function calls to Stopwatch.

    Also you may note that my initial approach was naively checking the elapsed milliseconds very loop, the new version only checks every so many ticks but uses modulus which is slow.

    So after removing the modulo and replacing with a moving bounds check you get this:

    Code (CSharp):
    1. // loop start (head: IL_0032)
    2.             IL_0032: ldarg.0
    3.             IL_0033: dup
    4.             IL_0034: ldfld int32 OneFrameTest::counter
    5.             IL_0039: ldc.i4.1
    6.             IL_003a: add
    7.             IL_003b: stfld int32 OneFrameTest::counter
    8.             IL_0040: ldarg.0
    9.             IL_0041: ldfld int32 OneFrameTest::counter
    10.             IL_0046: ldloc.0
    11.             IL_0047: bne.un IL_0069
    12.  
    13.             IL_004c: ldloc.0
    14.             IL_004d: ldarg.0
    15.             IL_004e: ldfld int32 OneFrameTest::checkCycle
    16.             IL_0053: add
    17.             IL_0054: stloc.0
    18.             IL_0055: ldarg.0
    19.             IL_0056: ldfld class [System]System.Diagnostics.Stopwatch OneFrameTest::timer
    20.             IL_005b: callvirt instance int64 [System]System.Diagnostics.Stopwatch::get_ElapsedMilliseconds()
    21.             IL_0060: ldarg.0
    22.             IL_0061: ldfld int64 OneFrameTest::millisecondsLimit
    23.             IL_0066: cgt
    24.             IL_0068: stloc.1
    25.  
    26.             IL_0069: ldloc.1
    27.             IL_006a: brfalse IL_0032
    28.         // end loop
    And a build gets 285,816,000 loops per second, but considering there are about 12 IL ops (in the main loop, IL_032 to IL_047, IL_068-9) that's about 3,429,792,000 ops a second!

    Yay, learnt something and found my 3.5 Ghz!

    What are you doing with your 16 ms or 58 million (Ghz x 16) ops a frame?
     
    Last edited: Feb 12, 2017
    GarBenjamin likes this.
  25. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,569
    As far as I'm aware .NET/Mono is supposed to convert IL code into CPU code. So in the end your CPU won't be executing IL code, but something else.
     
    GarBenjamin likes this.
  26. GarBenjamin

    GarBenjamin

    Joined:
    Dec 26, 2013
    Posts:
    7,441
    LOL! Good deal. I'd still like to see it in C or C++ as well having it run on a super lean OS if there is such a thing these days.
     
  27. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    So anyone figured out how to run 141,173,635 manic miner games on a modern CPU (3.5 Ghz)?



    Or how many manic miner games could you run at once?
     
  28. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,569
    Definitely not several hundred millions of them.

    One instance of ZX spectrum program would require 64k memory + several megabytes for emulator and there may be hidden overhead for OS bookkeeping. You'll run out of RAM long before you'll have a chance to approach the number you suggested.

    I'd say you'll run few hundred instances top, but even that is not guaranteed. Realistically I'd expect less than a hundred of programs being run simultaneously.
     
  29. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    In theory you are only running one program, and emulating one ZX Spectrum, but many instances or the active memory on the emulated Z80 e.g. registers/stack pointers.

    In a similar way to class instances, each instance only needs to store dynamic memory items, not the code or static memory items.

    And the ZX Spectrum ran at 25 fps, so you have 40 x 3.5 GHz = 140 million ops a frame.

    Assuming we want to run the spectrums in real time.
     
    Last edited: Feb 12, 2017
  30. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,569
    There was no distinction between static and dynamic memory on the machine. You need to store the whole block of memory. At max you can optimize away a quarter of it (used by the rom), but that's it. It is like real-mode DOS, pretty much, only with fixed architecture.

    IIRC ZX spectrum did not have a framerate. It had 7mhz cpu. But no framerate. There's a slight chance that I forgot about a ROM subroutine that could wait for a vblank, but I doubt it existed in the first place. The circuit responsible for TV output was polling ram at fixed interval (whether picture was ready or not), and its output was half-frames at 50 or 60 hz depending on video standard.

    ----

    Why are we even discussing it, anyway? Go to a legal zx spectrum archive, get a rom or two, and try to run one hundred million instances of it on your computer. See how it goes.
     
    Ryiah likes this.
  31. GarBenjamin

    GarBenjamin

    Joined:
    Dec 26, 2013
    Posts:
    7,441
    7 MHz?! Are you sure about that? That seems pretty extreme for 8-bit computers of the day. Most were around 1 MHz like the C64. I know some z80 CPU based machines of that time hit up to 4 MHz so it is possible it was that fast but just very surprising. 7 MHz is like incredibly fast for that time and was seen later on the 16-bit computers Amiga and Atari ST.
     
  32. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    GarBenjamin likes this.
  33. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    So if we could convert Z80 op codes to x86 opcodes a 3.5 Mhz Z80 CPU could do 3,500,000 ops a second or 140,000 ops a frame.

    Therefore in theory we could only run 1000 ZX Spectrums (a Z80 8,500 transistor CPU) on our modern 3.5 GHz (1.2 billion transistor) CPU.

    Well 1000 ZX Spectrums per core so on 8 cores = 8000 ZX Spectrums.
     
  34. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,569
    I believe that my zx spectrum clone was running at 7 mhz. Back when I had it I was damn sure about it. However, it was long ago, the machine is long dead, so I can not confirm it.

    PAL TV monitors ran at 50 hz, but were sending interlaced half-frames. Odd lines, then even lines. Not the same thing as 25 fps.
    In the spectrum itself there were no concept of "framerate". Traditionally, games draw into backbuffer, wait for vblank, then swap front and back buffer. In spectrum there is no vblank and no backbuffer.

    Doesn't work this way.
    There will be state switching/scheduling overhead, you'll still need to emulate hardware, and instruction conversion might not be possible. Due to limited memory the computer could use such amazing stuff as self-modifying programs, or relying on fixed predetermined adresses. All of this will obviously break the moment you attempt to convert opcodes.
     
    GarBenjamin likes this.
  35. GarBenjamin

    GarBenjamin

    Joined:
    Dec 26, 2013
    Posts:
    7,441
    One thing for sure is that where a bunch of that performance has gone is simple (I know here I go again... but just saying... lol)... we're not dealing with 320 x 240 16-color video displays these days. Nor synthesized sounds and so forth. I wish we were but we're not. And although our dedicated video cards should be easily handling the graphics this simple change in video display is representative of the change to everything in general. Just more of this. More of that. All around. And like someone also mentioned back then writing highly optimized code was the norm. These days writing very sloppy & otherwise wasteful code is the norm. All of this adds up.
     
  36. GarBenjamin

    GarBenjamin

    Joined:
    Dec 26, 2013
    Posts:
    7,441
    Well what you could do instead of all of this theorizing is set up a 1920 x 1080 display and split into 7x5 256x192 displays with each running a tiny retro remake you create of an old Spectrum game like the one you showed.

    So you'd have 35 Spectrum displays visible on screen at the same time. Not quite sure how you'd play them all unless you just had each virtual display running the exact same game. And still it would not be close to how much the raw computing power has increased since that time.

    It might be more beneficial to just realize you have a lot more power now than they had... maybe not as much as you should have based on the increase in raw hardware specs but still... enough to make anything they made and then jack it up by a significant factor.

    Basically you could just be making these games and getting them done.

    Like this guy who remade Manic Miner in AGK2 BASIC


    And this guy who remade Manic Miner in Gamemaker Studio


    And this person who remade Manic Miner in 3D using... something... perhaps Unity or maybe Blitz3D or maybe DarkBasic?


    I think that is an important thing to keep in mind btw when I look at retro games I often wonder what it would be like to remake them as completely 3D games. I think many would translate very well.

    Anyway... these folks are making games! :)

    We are not because we are spending our time talking about stuff on the forums. lol

    And speaking of that... I need to actually get some work done on my own game very soon. Need to have another enemy. Maybe a rolling enemy or a flying enemy. Or maybe it flies around swoops down at player and eventually lands and then rolls. Hmm... decisions.
     
    Last edited: Feb 12, 2017
  37. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,569
    Good point. Last page made me wonder why the hell I'm explaining details of PAL standard and Spectrum arch on unity forums.
     
    GarBenjamin likes this.
  38. Ryiah

    Ryiah

    Joined:
    Oct 11, 2012
    Posts:
    21,175
    Most computers of the time period had specialized chips for video and sound generation, but the ZX series was built around the idea of having the processor handles these tasks. I don't know if there was a name for the concept back then but these days it's called bit banging.

    http://electronics.stackexchange.com/questions/44670/what-is-bit-banging
     
    GarBenjamin likes this.
  39. GarBenjamin

    GarBenjamin

    Joined:
    Dec 26, 2013
    Posts:
    7,441
    That's wild.... I was thinking about that earlier wondering if it might be the case because I don't recall any games on those systems really standing out technically as using so much more processing power.

    The SMS console was a cool one because it had a 4 MHz Z80 and had the graphics chip sprites etc. I may have to get one again sometime. Add to the collection. I have an NES and Genesis and still have my Amiga but gave away the SMS long ago.
     
  40. hippocoder

    hippocoder

    Digital Ape

    Joined:
    Apr 11, 2010
    Posts:
    29,723
  41. MV10

    MV10

    Joined:
    Nov 6, 2015
    Posts:
    1,889
    01000001 01110011 01110011 01100101 01101101 01100010 01101100 01111001 00100000 01101001 01110011 00100000 01100110 01101111 01110010 00100000 01101110 00110000 00110000 01100010 01110011 00101110 00100000 01001101 01100001 01100011 01101000 01101001 01101110 01100101 00101101 01101100 01100001 01101110 01100111 01110101 01100001 01100111 01100101 00100000 01000110 01010100 01010111 00100001
     
  42. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Translation: Assembly is for n00bs. Machine-language FTW! -> http://www.convertbinary.com/
     
    Last edited: Feb 13, 2017
  43. Ryiah

    Ryiah

    Joined:
    Oct 11, 2012
    Posts:
    21,175
    01001001 00100000 01110100 01101000 01101001 01101110 01101011 00100000 01110111 01100101 00100000 01100001 01101100 01101100 00100000 01101011 01101110 01101111 01110111 00100000 01110111 01101000 01100101 01110010 01100101 00100000 01101001 01110100 00100000 01100111 01101111 01100101 01110011 00101110 00100000 01011010 01100101 01110010 01101111 01110011 00100000 01100001 01101110 01100100 00100000 01001111 01101110 01100101 01110011 00101110

    I think we all know where it goes. Zeros and Ones.
     
    MV10 likes this.
  44. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    The thing is at the moment it's not easy with Unity to access all your CPU cores/threads or to manage your useage of them within your target frame rate.

    Let's say I write a single screen platformer, then try and get as many instances of the platformer running in Unity within 16ms as possible.

    In theory I should have about about 50 million ops a frame/core, give or take the ops Unity will use for 2D physics, sound, sprites/rendering.

    Ideally Unity would be able to run multiple scenes at the same time, across all available cores then a multi-region game would still be updated as NPC do their thing.
     
  45. GarBenjamin

    GarBenjamin

    Joined:
    Dec 26, 2013
    Posts:
    7,441
    What is the purpose of it? Not saying I don't find it interesting but where are you going with this?
     
  46. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,569
    Utilizing the whole CPU won't give you a thing if you're GPU bottlenecked, and if you aren't even doing anything that requires full CPU utilization.
     
    Kiwasi and Ryiah like this.
  47. Tautvydas-Zilys

    Tautvydas-Zilys

    Unity Technologies

    Joined:
    Jul 25, 2013
    Posts:
    10,678
    I know this may sound old, but use a profiler. Really. It will tell you exactly where your time goes.
     
  48. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Unity is in the process of updating the mono/.net compiler and framework so soon we will have access to Parallel foreach loops so we can more easily multithread our code.

    Is monitoring the elapsed milliseconds the only way my code would know when it needs to stop processing and allow Unity to render the frame?

    Or how can I take advantage of all the power of a modern CPU without stepping on Unity's toes in the Tango that is my games 16ms frame.
     
  49. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,569
    Normal programmer's workflow:

    1. Know what you want to do and how you want your game to behave.
    2. Implement it
    3. Find a portion that takes too long using profiler, and fix the bottleneck.
    4. Repeat #3 forever or until satisfied with performance.

    Monitoring elapsed milliseconds will only slow down your program.
    Trying to "utilize modern CPU" without even needing it will only waste development time and is a premature optimization.

    Basically, you're being distracted by technical features, technologies and buzzwords from actually working on your project.
     
    Kiwasi, Ryiah and MV10 like this.
  50. MV10

    MV10

    Joined:
    Nov 6, 2015
    Posts:
    1,889
    You forgot to add "as usual."
     
    HolBol, Ryiah and neginfinity like this.