Sunday 4 January 2009

Embedding values in C++ pointers

Short version:

Read, compile and run the following piece of C++ source code:

#include <iostream>

class Smi {
public:
static Smi* fromInt(int value) {
return reinterpret_cast<Smi*>(value);
}
int value() {
return reinterpret_cast<int>(this);
}
void sayHello() {
std::cout << "Hello, my value is "
<< this->value() << "."
<< std::endl;
}
};

int main(void) {
Smi* five = Smi::fromInt(5);
Smi* seven = Smi::fromInt(7);
five->sayHello();
seven->sayHello();
return 0;
}


Notice the only state class Smi has is the integer value "embedded" in "this" pointer.

Lil' longer version:

C++ allows you to do pure magic. From time to time I see a piece of C++ code that makes me think: "Does this even compile?". A few days ago I discovered one of those "gems" within V8 (google chrome's javascript engine) source code.

Let's begin with a quiz: what do you think the following code could be for?

reinterpret_cast<int>(this)


Now with some more context...

int Smi::value() {
return reinterpret_cast<int>(this) >> kSmiTagSize;
}


Ummmm..... :|
Well, let's unveil the mystery...

V8 models every representable entity available in javascript (ECMAscript), all deriving from class Object, as comments in file objects.h sugest:

//
// All object types in the V8 JavaScript are described in this file.
//
// Inheritance hierarchy:
// - Object
// - Smi (immediate small integer)
// - Failure (immediate for marking failed operation)
// - HeapObject (superclass for everything allocated in the heap)
// - JSObject
// - JSArray
// - JSRegExp
// - JSFunction
...


Every instance of such entities is allocated and managed by class Heap, V8's runtime memory manager. When Heap is asked to allocate an Object, it returns an Object*, but such pointer carries a hidden surprise, as comments in objects.h depict:


// Formats of Object*:
// Smi: [31 bit signed int] 0
// HeapObject: [32 bit direct pointer] (4 byte aligned) | 01
// Failure: [30 bit signed int] 11


Such comments state three things (apart from the obvious one: what Heap returns as pointers to Object are not such...):
  • the least significant bits of the "pointer" carry a "tag" to indicate the kind of Object.

  • In the case of Smi* and Failure*, the bits remaining are not used to store any kind of pointer, but a numeric value (31 and 30 bits long respectively). This is the way to create an Smi*...

    Smi* Smi::FromInt(int value) {
    ASSERT(Smi::IsValid(value));
    return reinterpret_cast<Smi*>((value << kSmiTagSize) | kSmiTag);
    }

    ...and this is how to retrieve the value...

    int Smi::value() {
    return reinterpret_cast<int>(this) >> kSmiTagSize;
    }

    Thus they avoid the "overhead" of storing the pointer and the pointee, as both are the same.

  • When the "pointer" "points" to a HeapObject instance, the 30 most significant bits carry an actual pointer to a HeapObject that is aligned to 4 bytes, thus the two other bits are always zero, space which is used for the tag. To illustrate this, the following piece of code is the one that, from a true object address, makes up the tagged pointer:

    HeapObject* HeapObject::FromAddress(Address address) {
    ASSERT_TAG_ALIGNED(address);
    return reinterpret_cast(address + kHeapObjectTag);
    }



The trick works as long as you don't try to dereference one of those Object*...

Like some stuff in V8 native code generators I blogged about some time ago, this "tagged" "pointer" trick is not new but can also be found in StrongTalk and SelfVM (respectively Smalltalk an Self virtual machines that share creators with V8 :p)

Hope you enjoyed this curious trick!

No comments:

Post a Comment