Objectfying the Smalltalk Virtual Machine

Tony Hannan
May 2004

Smalltalk is touted as a pure object-oriented system, yet the virtual machine is a huge peace of C code. Why can't there be objects like Garbage Collector and Message Dispatcher, and even hardware objects like Memory, Disk, and Processor. Of course, some of the core semantics have to be implemented at a lower level, but much of the VM functionality can be implemented as objects. For example, a copying garbage collector can be implemented using objects that are allocated in To space after the initial flip. This is ok as long as amount of objects it generates is insignificant compared to the amount of objects it is collecting. If not, critical sections can be rewritten to use lower-level semantics like explict deallocation.

Native Code Compilation

Converting the VM to objects requires that objects (message passing and methods) are fast. We can achieve this by getting rid of the Interpreter and compiling methods down to machine code while optimizing. Critical code can be written directly in the lower-level intermediate language.

Compiling to machine code does not mean we are giving up late binding and dynamism (create methods at run-time). Machine-code objects will be dynamic objects just like bytecode objects.

Low-Level Typed Intermediate Language

The intermediate language will be low-level like assembly code. This intermediate form should be common across all platforms to maintain portability of low-level code. It also should be condusive to optimization. Typed CPS (Continuation Passing Style) is a good intermediate language. Below is the factorial example written in this intermediate language. It is verbose compared to higher-level bytecodes, so it will not persist unless there is no higher-level representation.

n : Int
returnLabel : (Int Any)
returnEnv : Any
Factorial (n returnLabel returnEnv)
	t := n <= 0
	t -> returnLabel (1 returnEnv)
	Allocate (FactEnv Recurse n returnLabel returnEnv _)
env : FactEnv
Recurse (env n returnLabel returnEnv _)
	env.n := n
	env.returnLabel := returnLabel
	env.returnEnv := returnEnv
	n := n - 1
	Factorial (n Return env)
fact : Int
Return (fact env)
	n := env.n
	fact := n * fact
	Overflow -> LargeFact (n fact env)
	returnLabel := env.returnLabel
	returnEnv := env.returnEnv
	returnLabel (fact returnEnv)
type : Type
returnLabel : (type Any Any Any Any)
a b c d : Any
size : Int
top : Addr
Allocate (type returnLabel a b c d)
	newObject := HeapTop : type
	size := type.size
	top := HeapTop + size
	t := top >= MaxHeap
	t -> GC (type returnLabel a b c d)
	HeapTop := top
	returnLabel (newObject a b c d)

The last line of a basic block is always a goto. -> means if-goto. Lowercase variables are local to their basic block. Variable types are declared outside basic blocks and apply to all variables below it with the same name.

Types

Objects are instances of named record types. Types use type-name equivalence, and types can declare to be subtypes of other types. Types correspond to Smalltalk classes. The default type for all record fields (instance variables) in Smalltalk is Any (Object), but they can be changed to any specific type, thus providing optional static typing. Implicit type casts, like from Any to Env, are dynamically checked. Explicit type casts are unsafe type conversions, but only a few are allowed, like from memory address to object. There are three tag types: 00 = SmallInteger (Int), 11 = Object pointer, and 10 = Memory address.

Security

Should low-level operations like explicit type conversion be available to the Smalltalk programmer? I say yes, but they should reside in a different namespace that is not available to high-level Smalltalk methods. Only when you enter Core modules should you be able to use the Intermediate Language directly along with its low-level operators like type conversion.

Portability

Since the VM has some platform specific code, the new Smalltalk image will not be portable in the same way. But this is not a problem since we can achieve portability by defining a common format for sharing. This format is the image segment format. Objects including bytecode methods and CPS methods can be transported across platforms in an independent format. Method objects can encode changes to be executed on the other end to regenerate platform-specific objects. Still, if this is not enough, the System Tracer can be used to generate an image file for a different platform.

Garbage Collection

Activation records (continuations) are explicitly created at the CPS (intermediate) Language level, and since they are typed, they are safe for the garbage collector to trace. Local variables in the CPS Language are also typed, so they are safe to trace as well. However, each CPS instruction expands to one or more machine instructions, which may introduce new intermediate values. These are untyped and potentially untagged, so they can not be traced. Furthermore, they become invalid if their target objects move during garbage collection.

One solution to this problem is to never allow a thread to get suspended in an intermediate machine state. This involves delaying interrupt handling until the current thread yields itself when it is in a stable state. When an interrupt is received the system simply sets a flag, stores the event, then resumes the current thread. The current thread checks this flag whenever it is in a stable state, or less often if desired. If the flag is set the thread suspends itself then jumps to the event handler. The current Smalltalk implementation works similarly.

Another solution is to allow a thread to get suspended at any time, but have the garbage collector rollback the unfinished CPS instruction. When an interrupt is received it can be processed right away. If it triggers a garbage collect, suspended threads are rolled back appropriately before being traced. In order for this to work a few restrictions need to be in place. First of all, the critical machine instruction that affects the CPS state must always be the last machine instruction for that CPS instruction. In other words, once it occurs we are in a stable CPS state, otherwise rolling back would try to undo a state change that could have been read by another thread. Second, intermediate machine values can't be live across CPS instructions, because their values will become invalid. Finally, a map from machine code to CPS code has to be generated or saved so we know how far to roll back and which registers are garbage intermediate values.

The first solution requires polling, but the second solution restricts liveness which may have adverse affects on optimization. I am undecided which one I like better, but neither one seems too bad.

Conclusion

The many advantages of this proposal are: (1) objectfying the VM will modularize and simplify the VM, (2) compiling all methods to optimized machine code should speed up the system even if some of the VM code is not as fast as its original optimized C, (3) the low-level intermediate language allows alternative languages and implementation to coexist in the same image, (4) optional types will document interfaces and help find programmer errors, and (5) having the intermediate language and the machine language in the image gives the Smalltalk programmer more control over his machine.

Related Work

Sophisticated implementations of Lisp, Scheme, and ML also compile down to machine code, foregoing a VM. However, they do not implement would-be VM code, such as the garbage collector, in the same high-level language. This is an advantage of the Smalltalk philosophy, where everything (as much as possible) is available to the user for inspection and modification.