Search Unity

C# Lesson: Why is string an object?

Discussion in 'Community Learning & Teaching' started by PhobicGunner, Aug 15, 2014.

  1. PhobicGunner

    PhobicGunner

    Joined:
    Jun 28, 2011
    Posts:
    1,813
    This is something newcomers to Unity often face when learning C# - strings in C# are actually an object, unlike ints, floats, and other value types. That means strings will generate garbage.

    But why? Why didn't Microsoft make strings a value type, just like every other value in .NET? The answer lies in the difference between the heap, the stack, and how computers pass data around.

    Everything I've learned here was learned over the weekend while designing a retro console-style virtual machine which runs inside Unity, which means I've learned all of this information from firsthand experience.

    So, first, how the CPU works. The CPU has a few registers available to it, and these can store numeric values temporarily. Now, obviously this isn't enough for your variables, so there's also RAM available to the CPU.
    Modern CPUs generally have two registers set aside, generally referred to as the Stack Pointer (ESP) and the Base Pointer (EBP). We'll just look at ESP for now.

    Now, the way the stack works, is the Stack Pointer points at an address in memory, and the stack actually grows "downwards". So this is what our stack might look like initially (we'll assume each block is a 32-bit segment, I apologize for my poor diagram skills):
    EmptyStack.png
    Now let's say I want to push the integer value 50 onto the stack. This involves moving the Stack Pointer back 4 bytes, and then pasting that value into the new location in memory pointed to by SP:
    Stack_01.png
    And maybe then I push the value 15:
    Stack_02.png
    Now, let's say I want to now pop the topmost value off of the stack. This involves moving SP forward by 4 bytes (instead of backwards). Actually, it also involves copying the value at the location pointed to by SP (15) into a CPU register, usually EAX, beforehand.
    After I pop the topmost value (15) off of the stack, I'm back to having SP point at the location which contains the value 50. If I then pop again, I'm back to where I was before, with an empty stack and SP pointing at the end.

    This is how all value types are stored. If you do this:

    Code (csharp):
    1.  
    2. int i = 50;
    3. bool b = false;
    4. Vector3 v = Vector3.one;
    5.  
    These are all translated into numeric values, or sets of numeric values (in the case of the Vector3, it's three floats) which are pushed onto the stack when they are assigned, and popped off of the stack when they go out of scope. Actually, the above might translate to psuedo-assembly which looks like this:

    Code (csharp):
    1.  
    2. push 50 ; i
    3. push 0   ; b
    4. push 0x3f800000  ; v.x, hex translation of 1.0f
    5. push 0x3f800000   ; v.y, hex translation of 1.0f
    6. push 0x3f800000   ; v.z, hex translation of 1.0f
    7.  
    The compiler actually keeps track of where these are going to be at the stack (actually, it computes an offset from what's called the Base Pointer, often the register is called EBP), so for example if you were to do this:

    Code (csharp):
    1.  
    2. i = 10;
    3.  
    It might produce something this code, which pastes the value 10 into the location on the stack that 'i' hypothetically occupies:

    Code (csharp):
    1.  
    2. mov [ebp - 4], 10
    3.  
    As value types are fixed in size, they are quite easy to modify "in place".

    Now, back to strings. Remember, everything boils down to numbers as far as the computer is concerned, and for strings it's a bunch of numeric values which translate to characters.
    So, let's say we want to push a string value onto the stack. Say, "Hello world" (I'll assume ascii)
    That might look like this:

    Code (csharp):
    1.  
    2. push 0x48 ;H
    3. push 0x65 ;e
    4. push 0x6c ;l
    5. push 0x6c ;l
    6. push 0x6f ;o
    7. push 0x20 ;(space)
    8. push 0x77 ;w
    9. push 0x6f ;o
    10. push 0x72 ;r
    11. push 0x6c ;l
    12. push 0x64 ;d
    13.  
    Yuck. Lots of code, not very efficient, takes up lots of stack space to boot. Now, remember when I talked about modifying local variables?
    It works in that case, because those variables are fixed in size. That is, whatever value you assign to an integer is always going to be 32 bits, therefore you can just paste it into the same location in memory with no risk of stomping on anything else.

    However, strings are not fixed size, so the same thing could not be done.

    Now, what about strings as objects? If we represent a string as an object, instead of a value type, then the string is never allocated on the stack. Instead, it's allocated on the heap. The heap is a giant pile of memory where a program can request blocks of bytes from, and return blocks of bytes to. Think of it like a pool - you can request memory locations from the pool, and release memory locations back to the pool.

    Now, if we want to allocate the above string "Hello World", we can request 12 bytes from the object pool, and then paste the string value into the memory address we get back. Then, guess what? We've got a memory address that points to the beginning of the string in memory - that value is fixed in size, and in fact can be pushed onto and popped off of the stack!
    Now, to address the previous problem. We want to change the string. How exactly does the string being on the heap solve this?
    The answer is it doesn't! You still can't change the string (well, actually, in C++ you can so long as the new string's size is the same as or less than the old string). You can, however create a new string, and by nature of how the heap works you can release the old string back to the heap.

    And THAT is why strings cause garbage collection! Since you can't modify the value of the string, in C# what happens instead is that an entirely new string is created, and the Garbage Collector is tasked with cleaning up the old one.
     
  2. PhobicGunner

    PhobicGunner

    Joined:
    Jun 28, 2011
    Posts:
    1,813
    Another point as pointed out by Tamschi on the Facepunch forums:

    If strings were passed by value, you would end up with a stack push per character (or, perhaps, per 2 or 4 characters) every single time. That's a lot of memory operations, and is not at all efficient. Passing the string around by reference, however, is only a single stack push (a pointer to the string).
     
    Last edited: Aug 15, 2014
  3. Eric5h5

    Eric5h5

    Volunteer Moderator Moderator

    Joined:
    Jul 19, 2006
    Posts:
    32,401
    But strings are passed by value? Unlike a char array, for example, which is passed by reference. Strings are kind of weird hybrid thingies, where technically they're char arrays but in many ways they don't behave like it.

    --Eric
     
  4. PhobicGunner

    PhobicGunner

    Joined:
    Jun 28, 2011
    Posts:
    1,813
    They are NOT passed by value. They are technically passed by reference.

    So when you do this:

    Code (csharp):
    1.  
    2. string myString = "Hello, world!";
    3. foo( myString );
    4.  
    5. void Foo( string str )
    6. {
    7.     // str is actually a reference to myString right now!
    8.     // rather than the value of myString being pushed onto the stack (which would be inefficient)
    9.     // a pointer to myString is pushed instead.
    10. }
    11.  
    In Foo, you actually have a reference to myString.
    BUT, if you do this:

    Code (csharp):
    1.  
    2. void Foo( string str )
    3. {
    4.     str = "Awesome";
    5. }
    6.  
    Then you're effectively creating a new string, so now str references the string you just created, instead of the string reference it was passed. Same as if you did this:

    Code (csharp):
    1.  
    2. SomeClass myclass = new SomeClass();
    3. bar( myclass );
    4.  
    5. void bar( SomeClass aClass )
    6. {
    7.     aClass = new SomeClass();
    8. }
    9.  
    You didn't override myclass, you simply pointed aClass at a new instance of SomeClass.

    EDIT: If strings were passed by value, that would result in a lot of memory copies (so would be fairly inefficient), not to mention they would totally chew up your available stack space (which is limited by nature)

    EDIT 2: So yeah, strings really are just char arrays (and that's exactly how they are implemented internally), it's just that .NET doesn't let you modify those char arrays (as a design decision), it only ever lets you create new ones (thus the term immutable)
     
    Last edited: Aug 15, 2014
  5. Eric5h5

    Eric5h5

    Volunteer Moderator Moderator

    Joined:
    Jul 19, 2006
    Posts:
    32,401
    OK, that makes sense. Although since strings are immutable, it seems like for most purposes they may as well be passed by value, in terms of how the code works. But it's nice to know that just passing a string into a function doesn't actually make a copy.

    --Eric