Documentation of the Internals of Mikelisp ========================================== Structures ---------- The basic structure to understand is struct obj in lisp.h. This is a 4-word structure (except when DEBUG is defined) that has a pointer to an object type struct, a reference count, and a 2-word union whose interpretation depends on the objtype (for instance, a string object has a string length and a pointer to the string data). Objtypes are static structures created at compile time. They have a name (just a static C string), a name object (to be returned from a call to the object-type function), and a number of "methods": tostring return a string object representing this object predropobj called when the object's refcount drops to 0 dropobj called when the object is about to be freed init called when an objtype is registered exit called when an objtype is about to be unregistered The difference between predropobj and dropobj is that predropobj is called as soon as the reference count drops to 0 (so it should handle things like closing the FILE* of a stream object as soon as the object is inaccessible) whereas dropobj is called when the object is about to be reallocated (so it handles things like DECREFing the car and cdr of a cons). The reason to have two is that I developed mikelisp to be able to run real-time code; having to go through and DECREF all of the objects in a big cons tree as soon as its root is dropped prevents that from happening. This way, there's an upper bound to how much mikelisp can be doing behind the scenes between each instruction; take the worst-case runtime on that background work, assume mikelisp does that for every instruction, and you can calculate the maximum time it will take mikelisp to run a particular piece of code. This real-time requirement is also why I don't use a garbage collector; it's just too hard to determine worst-case runtime. Reference counts are incremented and decremented "frequently"; that is to say, on every change in the number of references to the object. The builtin C functions are considered to have a reference to the objects as well, so if you pass an object to a "regular" C function you'll need to INCREF() the object before passing it in (if you're just passing in a copy) and you'll need to DECREF() the result when you're done with it. There are some C functions that do not behave like that; they're generally the ones that take something other than obj*s as arguments or return something other than an obj*. Functions that return an obj* return 0 on failure (signifying that they raised an error). The VM handles passing that error up the call chain until a catcher is found. 0 is never a valid obj*; both #f and the cdr of a 1-element list are &nilobj. Files ----- The .c files are broken out mostly by object type. Some .c files (compiler.c, functions.c, and macroexpand.c) are generated from the corresponding .lisp file by the compile-file function. The other files that don't correspond to object types are: main.c the location of the main() function opcodes.c generated from vm.h by Perl preload.c #includes the .c files generated from .lisp code throw.c throw and catch functions, and exception/exceptionvalue obj*s util.c miscellaneous utility functions for internal use vm.c implements the VM; may eventually subsume bytefunc.c Flow Control ------------ The program starts out in main.c by registering all the objtypes, calling the *_init functions and aborting if anything returns true (i.e., failure). main_init and env_init create and define the #argv and #environment objects, respectively. With that, we're ready to call run_vm() (in vm.c), which tries to lookup #init and then call it as a function. #init is defined in functions.lisp, which gets compiled into functions.c, where it is included into preload.c. Assuming (#init) returns an object (as opposed to 0 to signify "returning" an exception), run_vm will DECREF() it and return a boolean (false if it was &nilobj, true otherwise). Then main() calls the *_exit functions. util_exit() handles dropping all objects on the free list (so that files are closed, etc.). If DEBUG is defined, clean_up() will end by checking whether there are any objects still allocated. If there are, the list of remaining objects are printed on exit so that debugging can occur. Note that any objects (or scopes) remaining after execution are indicative of a bug in mikelisp itself; lisp code run in mikelisp should never be able to cause such a condition, even in the presence of invalid bytecode. The Virtual Machine ------------------- In vm.c, run_vm() looks up the value of #init and calls run_it() with that value. The main execution occurs in run_it(), which sets up a context (defined in vm.c) and starts executing it. When the initial function (the one passed in from run_vm()) executes a RET instruction, run_it() returns the result to run_vm(), which DECREF()s it and returns true if the value isn't the #f object. The VM is purely stack-based, with a singly-linked list of contexts (i.e., contexts of execution, or functions being evaluated). Each context has a single stack for arguments (implemented as a flat array of obj*s per context for now, but this could be a VM-wide single big cons'd list) and a possibly-empty singly-linked list of exception catchers. When an exception is thrown, contexts are popped until one with a catcher is found. Execution continues with the program counter and stack pointer given in the catcher, and that catcher is discarded. A catcher is type-agnostic and will catch any exception; if the catcher only wants to handle a particular type of exception, it needs to re-throw the exceptions it doesn't care about. Note that it can re-throw unconditionally to implement unwind-protect functionality. A bytefunc is composed of a read-only string and a list obj*. The string is composed of instructions (which are bytes with MSB=0), each of which is preceeded by an argument (0 or more bytes with MSB=1). If the argument is zero bytes long it is taken to be 0; otherwise, the 7 LSB from each argument byte are concatenated (big endian) and sign-extended to form the argument. The list is used for the LOCALOBJ instruction, which takes the list obj*, does CDRs and one CAR, and pushes the result on the stack. Opcodes ------- Opcodes are defined in vm.h and are implemented in vm.c. To add new instructions, add the opcode to vm.h, add a tracing line to vm.c in the DEBUG section near the top of the main loop, add the code to handle the instruction in the big switch in vm.c, and possibly add it to the disassemble and disassemble-print functions in compiler.lisp. Also, if it's a jump-type instruction, add it to the flow-control-opcode? function and to two places in the byte-compile-flatten function in compiler.lisp. The Makefile automatically generates opcodes.c from vm.h using Perl; if you don't have Perl, you need to update it by hand. Below is a list of opcodes, with their bytecode representation, opcode name, and a brief description of what they do. Consult vm.c for their definitions. <#> in the description refers to the argument to that opcode: ! not Replaces the top item with #t or #f # needargs Throws an exception if there aren't exactly <#> values on the stack % bitsel Replaces top three items with their bitselection (see integer.c) & and Replaces the top two items on the stack with #f or the second item ' quote Replaces the top item with a quoted version of itself * mul Replaces the top two items with their product + add Replaces the top two items with their sum , unquote Unquotes a quoted item (throws exception if item isn't quoted) - sub Replaces the top two items with their difference / div Replaces the top two items with their quotient 0 nil Pushes #f on the stack 1 extype Pushes the type of the last exception onto the stack 2 exvalue Pushes the value of the last exception onto the stack : throw Generates an exception of type and value <2nd-item> = eq Replaces top two items with #t or #f > asr Arithmetic shift right <2nd-item> bits and pop (This is only here because it's used in arg-to-raw-bytecode) ? cmp Replaces top two items with their comparison (based on <#>) A define Define the word as the value <2nd-item> (popping them) C call Call the function with the next <#> items as args D pop Drop the top <#>+1 items F bf Pop the top item and branch <#> bytes ahead if it's #f G goto Branch <#> bytes ahead unconditionally I index Push a copy of the item <#> items below onto the stack L localobj Do <#> cdrs on the function's localobj and push its car on the stack O objtype Replace the top item with its objtype P put Pop then replace item <#> below new with it (Basically the complement of the index opcode) R ret Pop as our return value, pop <#> items, then return (The stack should be empty afterwards, or an exception is thrown) S set Replace the value of word with <2nd-item> (popping them) T bt Pop the top item and branch <#> bytes ahead if it's not #f U undefine Pop the word and remove its value X exch Reverse the positions of the top <#>+2 items on the stack a car Replace the top item with its car c cons Replace the top two items with (cons <2nd-item> ) d cdr Replace the top item with its cdr i pushi Push <#> on the stack l lookup Replace word with its value p pair Replace with #t if it's a cons or #f otherwise r retcall Like a (call . <#>) with an immediate (ret . 0) after it (Doesn't save the caller's stack, so it allows tail recursion) { catch Set a branch location for where the next exception will go (On exception, stack height will be set to what it was at the catch) (You can't ret/retcall with a "live" catch; use toss to clear it) | or Replaces the top two items on the stack with the "first" non-#f one } toss Clear the most-recent catch in this function (or error if none) Quoting ------- The quote function takes any obj* and returns a quote obj* that points to that obj*. Evaluating a quote obj* simply unwraps the quoted obj* and returns the original obj*. Since mikelisp doesn't have special forms (all list items are evaluated before a function is called), function arguments that aren't to be evaluated need to be quoted. For example, you need to use: (#define 'foo 42) instead of: (#define foo 42) There are convenience macros for define and other functions so you can do: (define foo 42) Preloaded .lisp Files --------------------- The preloaded .lisp files define a bunch of macros and functions. The initial function called (#init) checks if #argv starts with "-l"; if so, it will try to (load-file) the rest of the arguments and then nuke #argv. Otherwise, if #argv has at least one element, it will try to (load-file) that element, leaving the rest of the elements in #argv. Then a read-eval-print loop is called. Macroexpansions --------------- Macro expansions (defined with macro-define) are given the entire list as an argument (i.e., with the operator element at the start). To indicate that macro expansion should not be performed on the list, the macro should return its argument. (This is safe because a cons cannot be changed after being created, so returning that cons means the rest of the list must be unchanged.) Compiler -------- Nearly all functions are byte-compiled (meaning they run in the VM). There are a few "rawfuncs" (like open-file) that are a different objtype and are handled through a call to a C function. There are no C functions that call back into the VM, so there are no concerns of overflow of the C argument stack or call stack. Functions like throw are implemented by way of a "byte compile helper" function. A function definition that's just been read in as a Lisp expression is first macroexpanded (in the read-eval-print loop), which turns it into a call to the byte-compile function with the arguments and body of the function as args. The byte-compile function passes those arguments to the byte-assemble function, which generates a numargs opcode (which ensures the right number of objects are on the stack at the start of the function) and appends the results of the call to byte-compile-helper and a ret opcode (which returns from the function). Opcodes are of the form (opcodename . argument), e.g., (pushi . 42) or (goto . -2) or (localobj . hello). The byte-compile-helper function takes a Lisp expression, a list of "named values" on the stack (the top, or active, part of the stack first) and the stack height of the innermost loop, and recursively generates a list of opcodes to evaluate the expression. For example, (+ 1 x) as the expression and (#f x) as the list of named values would generate ((pushi . 1) (index . 2) (add . 0)), which pushes a literal 1 on the stack, copies the third-highest value on the stack (the value of the local variable x), and adds the two together, leaving the value on the stack. When byte-compile-helper is called upon to compile a cons expression, it either calls a helper function for that known function (e.g., byte-compile-add for that (+ 1 x) expression) or else the byte-compile-helper-unknown-function function. After generating the list of opcodes, the localobj opcodes are parsed to generate a list of unique localobjs needed and assign their positions in the localobj list. Then the list of opcodes is passed to byte-compile-flatten which makes multiple passes through the list of opcodes and translates the branching opcodes' arguments from "number of instructions to skip" to "number of bytes to skip", and returns the bytecode string representing those opcodes. Optimizer --------- When initially compiling a function into a list of opcodes, labels are not used; rather, the arguments for branch opcodes are simple integers that tell how many opcodes to branch forwards (positive) or backwards (negative) from the next opcode. However, the disassemble function converts branch arguments into words and inserts those words as labels into the opcode list. This allows the peephole optimizer to know which runs of opcodes will always be executed in series. The reassemble function takes out the labels and translates their uses back into "number of instructions to skip" before reassembling the function. Optimization starts with the optimize-assembly function, which runs all the optimizer stages sequentially over and over until no further progress is made, then expands all (ret . x) opcodes where x>0 and loops over the optimizer stages again. The optimizer stages are given the whole instruction list and either return the original list (i.e., something eq? to its argument) or else an optimized list. It is important that the optimizer stages always return a "better" list if they don't return the original; if there exist optimizations that might keep changing the list forever, the optimizer will run forever on some inputs. The bulk of the optimization occurs in the optimize-assembly-peephole function, which looks at runs of one to four opcodes and tries to replace them with better alternatives. It has the ability to treat a label as an opcode, so it can do things like: LABEL1 nil<0> LABEL2 pop<0> -> nil<0> LABEL2 pop<0> LABEL1 The other optimizer stages do things like: * A goto pointing to a ret (or retcall or goto) can be changed into a ret (or retcall or goto) * Unreachable instructions are removed * Gotos going to the next instruction can be removed * Conditional branches going to the next instruction are changed to pops * Unused labels are removed, as are labels equivalent to previous labels There is also support for "binding" a function to itself to allow for tail recursion. Normally, a function like: (define (f a n) (ifelse (< a 2) n (f (- a 1) (* n a)))) will do a lookup of the value of f every time, just in case it changed. For example, if you did (define g f) and then (set (f a n) 42), (g 3 1) would return 42. Binding takes a name and a function, and changes retcalls in the function that call a function with the given name into gotos to the top of the function. After optimizing and binding f, the call to the function is translated into a goto to the top of the function. Now if you did (define g f) and then (set (f a n) 42), (g 3 1) would return 6. Optimization and binding are not done automatically. The compile-file function will optimize bytefuncs it compiles into C code, but expressions parsed at run time are not optimized or bound unless you explicitly do so with the optimize or bind functions. Initial Objects --------------- All the bytefuncs available when mikelisp starts are defined in .lisp files that are compiled into C code by the compile-file function and its helpers. compile-file takes two filenames and tries to translate the Lisp objects in the first file into a fragment of C code in the second file. It cannot handle arbitrary Lisp objects; for example, (+ 2 3) as a top-level object will result in an error. But it does handle everything needed to get a fully-functioning runtime environment compiled into mikelisp. The objects it can handle are: * Function definitions * Macro definitions * Defining a variable as #f * Defining a variable as a cons of two words (used to make "special" values) * Calls to the define-character-macro and function-compiler helper functions Any bytefuncs to be compiled must have "simple" localobj lists. "Simple" lists in this case are lists whose cdr is #f and whose cars are all strings or words with arbitrary amounts of quoting. Note that this means you can't use quoted objects like '(1 2 3) in a function; use `(1 2 3) instead. Also note that strings that may be eq? to other strings in the Lisp code will not be eq? in the resulting C code, but since all Lisp objects are read from file this isn't a problem. For each object read, compile-file outputs one-line comments with the object, the macroexpansion of the object, and the evaluated value to be assigned to the word. For bytefunc definitions, a one-line comment with the optimized bytefunc is also output. Then the C code snippet that will generate that object is output. The C code snippets thus output expect to have obj* variables c, o, and s available for temporary use. The preload.c file just defines those variables and #include's the compiled C code files. The Makefile compiles *.lisp into obj/*.c2 and then moves obj/*.c2 to obj/*.c so that a failed compilation won't result in a broken mikelisp if you run make again.