Have you ever wondered how programming languages like Python, Java or Ruby work? Do you have an interest in building your own language? If so, this series of posts is for you. We dive into building a novel programing language from first principles. Get ready to have your hands dirty. It’s going to be a fun ride. Access the final product here : https://github.com/daviddexter/karis-programming-language
We are finally here. That is what we have been aiming for. The architecture we choose to implement the compiler is simple so that we can focus on learning. The compiler will accept the AST from the parser. With the AST at hand, we iteratively convert each node and convert to an array of bytes. Yes, bytes. You might think working with bytes is hard. Actually, that could not be further from the truth. The choice to work with bytes is so that we compress the program into numbers that represent the intentions of the program. Let’s take a look at an example. Consider the following program:
let name @string = "Karis";
println(name);
Here, the program does two independent things:
- It assigns the variable name to the value “Karis”
- It prints the value associated with the variable
name
From this, we can tell that the compiler should generate two different instructions each telling it what to do. We may choose the number 100
to mean assign while another number 200
to mean print. The choice is arbitrary. These numbers that tell the compiler what to do are called Opcodes
. We will define a specific set of instructions
that only the compiler and its accompanying virtual machine know about. The composition of all instructions bytes is called a Bytecode
which the virtual machine will consume to execute the program. Let’s get started.
Compiler setup
We will create a new package called compiler
.
cargo new --lib compiler
Next, we will define a CompilerWorker
struct. This object will be the entry point to the compiler via the APIs we’ll expose. Let’s define the struct.
pub struct CompileWorker {
program_ast: Objects,
// these are the instructions that the VM will execute
pub instructions: Rc<RefCell<Vec<Vec<u8>>>>,
// these are values that don't change. E.g literals
pub constants: Rc<RefCell<Vec<CompileObject>>>,
pub scope_id_counter: Rc<RefCell<usize>>,
}
In Line 2 the field program_ast
will hold the AST passed from the parser. In Line 5 the field instructions
is a vector of all instructions generated from the AST. Notice that we wrap the vector in an Rc
to allow multiple owners and RefCell
to allow internal mutability. In Line 8 we have the field constants
which will hold constant values from primitive types. In Line 10, the scope_id_counter
field holds information about the scope.
impl CompileWorker {
pub fn new(ast: Objects) -> CompileWorker {
CompileWorker {
program_ast: ast,
instructions: Rc::new(RefCell::new(Vec::new())),
constants: Rc::new(RefCell::new(Vec::new())),
scope_id_counter: Rc::new(RefCell::new(DEFAULT_SCOPE_ID)),
}
}
pub fn compile(&self) -> ByteCode {
todo!("")
}
pub fn global_scope_id(&self) -> [u8; 2] {
let mut global_scope_id = [0; 2];
LittleEndian::write_u16(&mut global_scope_id, DEFAULT_SCOPE_ID.try_into().unwrap());
global_scope_id
}
}
From Line 2 to Line 9 we instantiate a new CompileWorker
object. The object exposes a number of functions. The most important is compile
which calls implementation to carry out the actual compilation. Next, we need to hook in the CompileWorker
to the console so that it can be triggered from the command line.
.subcommand(
Command::new("compile")
.about("Produces an executable program from Karis source code")
.arg_required_else_help(true)
.arg(arg!(-p --filepath <PATH>).required(false))
.arg(arg!(-o --output <OUTPUTNAME>).required(false)),
)
Here we introduce a new command, compile
. The way the command should work is that it expects the user to pass the path to a file containing the source code the program was compiled and the name for the binary file.
Some(("compile", sub_matches)) => {
let file_path = sub_matches.get_one::<String>("filepath");
let output_name = sub_matches.get_one::<String>("output");
if let Some(file_path) = file_path {
if let Some(output) = output_name {
return compile_program(file_path, output);
}
}
Ok(())
}
Here, we intercept the all to the compile
command which subsequently call the compile_program
function. Let’s define it.
fn compile_program(file: &str, output_name: &str) -> Result<(), KarisError> {
let path = Path::new(file);
let path_str = path.to_str().expect("failed to get file path");
let file = std::fs::read_to_string(path_str)?;
if file.is_empty() {
println!("Nothing to compile \n");
} else {
let lx = lex::Lexer::new(file);
let mut parser = Parser::new(lx);
match parser.parse(Some("repl_evaluate_program.json")) {
Ok(program) => {
let compiler = CompileWorker::new(program);
let byte_code = compiler.compile();
}
Err(_) => todo!(),
}
}
Ok(())
}
From Line 2 to Line 4, we read the file passed from command line arguments. If the file is empty, we do nothing since there is nothing to compile. If the file has some content, we pass it to the lexer to generate appropriate tokens, and then to the parser to generate an AST. Notice we don’t involve the evaluator in this process. Once we have the AST, we instantiate the CompileWorker
with the AST and call compile. From Line 13, the appropriate bytecode has been generated but it will be the virtual machine to execute it. We’ll come back to it later. For the basic setup of the compiler is done.
Virtual machine implementation
The structure of the virtual machine is very simple. it only has one field which points to the bytecode generated by the compiler.
pub struct VM {
pub(crate) byte_code: ByteCode,
}
impl VM {
pub fn from_executable_file(file_path: &str) -> Result<VM, KarisError> {
let mut file = File::open(file_path).expect("unable to open file with error");
let mut buffer = Vec::<u8>::new();
file.read_to_end(&mut buffer)?;
let byte_code = ByteCode::try_from_slice(&buffer).unwrap();
Ok(Self { byte_code })
}
}
From Line 1 to Line 3, we define the entire structure of the virtual machine. In Line 6 we introduce the function from_executable_file
. As the name indicates, the function will called given the path of the binary file. It is expected that the binary will have an extension of krbin
. Since we are reading from a file, it means that we need to think about serialization and deserialization. Once option that we have is to use json as a serialization format. However, we want to a binary format. json is text. For our case, we will leverage a community package to borch
. Let’s now define the ByteCode
structure that will have support of binary serialization and deserialization.
#[derive(Debug, Clone, BorshSerialize, BorshDeserialize)]
pub struct ByteCode {
pub instructions: Vec<Vec<u8>>,
pub constants: Vec<CompileObject>,
pub global_scope_id: [u8; 2],
}
In Line 1 the derived headers BorshSerialize
and BorshDeserialize
give the ByteCode
struct the capability to transform to binary format and back. Once we have the binary representation, we use it write to a file. Let’s extend the CompileWorker
by adding a function that writes our program to a binary file.
pub fn write_as_executable(
&self,
output_name: &str,
byte_code: ByteCode,
) -> std::io::Result<()> {
let encoded = byte_code.try_to_vec().unwrap();
let file_name = format!("{}.krbin", output_name);
let mut file = File::create(file_name)?;
let mut perms = file.metadata()?.permissions();
perms.set_mode(0o700);
file.set_permissions(perms)?;
file.write_all(&encoded)?;
Ok(())
}
Now the virtual machine is guaranteed to ready only binary formatted files. Back to the VM
structure, we will introduce a function that will carry out the actual execution.
pub fn execute(&self) -> bool {
let mut result = true;
let instructions = &self.byte_code.instructions;
for instruction in instructions.iter() {
let instruction = instruction.clone();
let instruction = vec![instruction];
match self.executor(&instruction, None) {
Ok(_) => continue,
Err(err) => {
eprintln!("{}", err);
result = false;
break;
}
}
}
result
}
In Line 4 we iterate over the instruction and call another function called executor
to execute the instruction in context. Let’s see what this function looks like.
impl VM {
pub(crate) fn executor(
&self,
instruction: &Vec<Vec<u8>>,
params: Option<Vec<(CompileObject, CompileObject)>>,
) -> Result<CompileObject, errors::errors::KarisError> {
todo!("")
}
}
It takes a vector of instructions and returns a CompileObject
. This CompileObject
will be the result of executing every instruction. Depending on the instruction,
the CompileObject
can have an actual value or a null. Let’s now introduce the CompileObject
.
#[derive(Debug, Clone, Serialize, Deserialize, EnumAsInner, BorshDeserialize, BorshSerialize)]
pub enum CompileObject {
integer(isize),
String(String),
Boolean(bool),
Variable(Vec<u8>),
Array(ArrayObject),
Null,
}
impl CompileObject {
pub fn object_type(&self) -> &'static str {
match self {
CompileObject::Integer(_) => INTEGER_OBJECT_TYPE,
CompileObject::String(_) => STRING_OBJECT_TYPE,
CompileObject::Boolean(_) => BOOLEAN_OBJECT_TYPE,
CompileObject::Variable(_) => VARIABLE_OBJECT_TYPE,
CompileObject::Null => NULL_OBJECT_TYPE,
CompileObject::Array(_) => ARRAY_OBJECT_TYPE,
}
}
}
#[derive(Clone, Serialize, Deserialize, EnumAsInner, BorshDeserialize, BorshSerialize)]
pub enum ArrayObject {
integer(Vec<isize>),
String(Vec<String>),
Boolean(Vec<bool>),
}
impl fmt::Debug for ArrayObject {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
match self {
ArrayObject::Integer(integer) => write!(f, "{:?}", integer),
ArrayObject::String(string) => write!(f, "{:?}", string),
ArrayObject::Boolean(bools) => write!(f, "{:?}", bools),
}
}
}
The object is an enum which each field representing a certain value. Consider a program that sums integers. The result of the operation will be an integer. Hence the VM will represent that result as Compilebject::Integer(sum-value)
. Arrays are somewhat interesting. At a high level, an array should be a collection of Compilebjects
.
With this in mind, we can represent arrays as Array(CompileObject)
. However, this approach requires us to handle reference indirection and serialization/deserialization, both of which are intuitive. We will therefore introduce another object as seen in Line 22. This object will represent the array with its internal fields handling variations of array types. For Karis, arrays can only have primitive types. This makes our implementation easier. It also takes care of the need for serialization/deserialization. Next, we will update the console
to make use of the VM’s functionality.
First, we add the ability to write to a file by adding this line:
compiler.write_as_executable(output_name, byte_code)?;
Finally, we introduce the virtual machine. We accomplish this by adding a new subcommand called run
. The name is intuitive enough to indicate the intent to execute a program.
Some(("run", sub_matches)) => {
let file_path = sub_matches.get_one::<String>("filepath");
if let Some(file_path) = file_path {
let str = file_path.trim();
return run_program(str);
}
Ok(())
}
The command expects the path to the executable to be passed as an argument. It calls the run_program
function which will handle the logic of executing the program.
fn run_program(file: &str) -> Result<(), KarisError> {
let vm = VM::from_executable_file(file)?;
vm.execute();
Ok(())
}
As you can see the implementation is very simple. All it does is call the execute
function from the VM structure. We will handle that later. For now, we have a basic virtual machine structure that is now ready to receive and execute instructions.
At its simplest OPCODE
is a command that the compiler will generate for the virtual machine to execute. Since we are in the realm of using bytes, we need to way to map out different commands without explicitly using bytes. Enums to the rescue. Let’s create a few opcodes.
#[derive(Debug, Clone, PartialEq, PartialOrd)]
pub enum OpCode {
OpTerminal = -0x08,
OpNull = -0x09,
OpMain = 0x09,
OpConstant,
OpAdd,
OpMinus,
OpMultiply,
OpDivide,
OpModulus,
OpFunctionDef,
OpGetFunctionParameter,
OpReturn,
OpFunctionCaller,
OpGetCallerParameter,
OpGetBinding,
OpAddBuiltin,
OpPrint,
OpAddIfCondition,
OpGreaterThan,
OpGreaterThanOrEqual,
OpLessThan,
OpLessThanOrEqual,
OpEqualTo,
OpNotEqualTo,
OpAND,
OpOR,
OpLAND,
OpLOR,
OpBang,
OpJumpTo,
OpJumpToAlternate,
OpArray,
}
Notice each opcode has a value assigned to it. The value ranges from -8 to 255. At present, the maximum value is 38, giving us more room in the event we need to support more opcodes. Negative values have a special meaning as they mean a special intent to the virtual machine.
Here is a breakdown of what each OpCode represents:
OpTerminal
acts as an indicator to separate instructions.OpNull
is similar toOpTerminal
except that is only used when instructions overflow to more than one vector. For instance when dealing with functionsOpMain
indicates the main entry point of the programOpConstant
indicates an intent to retrieve a constant valueOpAdd
indicates an add operationOpMinus
indicates a subtraction operationOpMultiply
indicates a multiplication operationOpDivide
indicates a division operationOpModulus
indicates an operation to get the remainderOpFunctionDef
marks a function definitionOpGetFunctionParameter
marks an intent to retrieve a function parameterOpReturn
indicates an intent to return the results of an operationOpFunctionCaller
marks the function callOpGetCallerParameter
indicates an intent to retrieve parameters for a function callOpGetBinding
indicates an intent to retrieve the value for a variable bindingOpAddBuiltin
indicates the addition of a builtin functionOpPrint
indicates an intent to call the builtin print functionOpAddIfCondition
indicates the addition of a condition statementOpGreaterThan
indicate a logical greater-than checkOpGreaterThanOrEqual
indicates a logical greater-than-or-equal checkOpLessThan
indicates a logical less-than checkOpLessThanOrEqual
indicates a logical less-than-or-equal checkOpEqualTo
indicates a logical equal-to checkOpNotEqualTo
indicates a logical not-equal checkOpJumpTo
indicates an intent to jump into another set of instructions. Mostly used in conditional statementsOpJumpToAlternate
indicates jumping to an alternate set of instructions
Next, we need a way to convert the opcodes to and from a u8
. So, Rust provides the From
trait we will implement.
impl From<u8> for OpCode {
fn from(value: u8) -> Self {
if value == OpCode::OpTerminal as u8 {
OpCode::OpTerminal
} else if value == OpCode::OpMain as u8 {
OpCode::OpMain
}
/*Trimmed for readability*/
else {
OpCode::OpNull
}
}
}
Fantastic!!!. OpCode set. Now let do some compiling.
Compiling primitives should give us a taste of how instructions look like. But first, we need to add more infrastructure. Recall, the unit object of the AST is a node. For us to claim that we have compiled the program with the need to compile all nodes in the AST. We want to be able to use recursion to make our work easier. As such, we will define a trait all nodes in the AST must implement.
pub trait Compiler {
fn compile(&self,
worker: Rc<RefCell<&CompileWorker>>,
scope: SymbolScope,
scope_id: [u8; 2],
) -> Option<Vec<Vec<u8>>>;
}
The compile
method in the trait takes a worker, which is the CompileWorker
instance, a scope
, and a scope_id
. The scope tells the compiler and the virtual machine where a variable belongs to. A scope can either the Global
, Local
or Main
. While we may know which scope a variable belongs to, it is also important to know the exact scope the variable belongs to. Hence each scope has a unique id. Recall that a node is a part of the Object
enum from the parser. we will therefore implement the Compiler
trait from both Program
and Node
.
impl Compiler for Objects {
fn compile(
&self,
worker: Rc<RefCell<&CompileWorker>>,
scope: SymbolScope,
scope_id: [u8; 2],
) -> Option<Vec<Vec<u8>>> {
match self {
Self::TyProgram(program) => program.compile(worker.clone(), scope, scope_id),
Self::TyNode(node) => node.compile(worker.clone(), scope, scope_id),
_ => unreachable!("Cannot compile UNKNOWN or CONSUMABLE"),
}
}
}
impl Compiler for Program {
fn compile(
&self,
worker: Rc<RefCell<&CompileWorker>>,
scope: SymbolScope,
scope_id: [u8; 2],
) -> Option<Vec<Vec<u8>>> {
match self.non_root {
true => {
let mut res = None;
for item in self.body.iter() {
res = item.compile(worker.clone(), scope.clone(), scope_id);
}
res
}
false => {
for item in self.body.iter() {
item.compile(worker.clone(), scope.clone(), scope_id);
}
None
}
}
}
}
In Line 9, Line 10, Line 29, and Line 37 we call the compile method from the trait to pass the intent to the respective object. This is the idea we were going for with recursion. Let’s now move on to implement the same trait for Node
. The node implementation is where the bulk of compiler implementation will be located.
impl Compiler for Node {
fn compile(
&self,
worker: Rc<RefCell<&CompileWorker>>,
scope: SymbolScope,
scope_id: [u8; 2],
) -> Option<Vec<Vec<u8>>> {
let kind = self.identifier_kind.unwrap();
match kind {
IdentifierKind::INTLITERAL => {
12 let lit = self.left_child.as_ref().unwrap().as_ref().left().unwrap();
let int_value = lit.as_obj_integer_value().unwrap();
let int_lit = int_value.value.unwrap();
15 let obj = CompileObject::Integer(int_lit);
let wrk = worker.borrow();
18 let instructions = wrk.add_constant(scope, scope_id, obj);
Some(vec![instructions])
}
IdentifierKind::STRINGLITERAL => {
let lit = self.left_child.as_ref().unwrap().as_ref().left().unwrap();
24 let string_value = lit.as_obj_string_value().unwrap();
let string_lit = string_value.value.as_ref().unwrap();
let obj = CompileObject::String(string_lit.to_string());
let wrk = worker.borrow();
let instructions = wrk.add_constant(scope, scope_id, obj);
Some(vec![instructions])
}
IdentifierKind::BOOLEANLITERAL => {
let lit = self.left_child.as_ref().unwrap().as_ref().left().unwrap();
36 let bool_value = lit.as_obj_boolean_value().unwrap();
let bool_lit = bool_value.value.unwrap();
let obj = CompileObject::Boolean(bool_lit);
let wrk = worker.borrow();
let instructions = wrk.add_constant(scope, scope_id, obj);
Some(vec![instructions])
}
_ => None,
}
}
}
In Line 8 we extract the kind for node. This information will be used to guide subsequent logic flow. We then check the kind against all known identifier kinds specified in the parser as seen in Line 10. The logic for compiling primitive types is the same. Let’s examine the INTLITERAL
kind. In Line 12 we extract the literal value from the left child of the node. We do this because we are certain literal values are always on the left of a node. Since we are dealing with an instance of an integer, in Line 13 to Line 14 we retrieve the actual literal value. If the type would be different all that would change is the function call as seen in Line 24 and Line 36. In Line 15 we wrap the value in a compiler object and then pass it to the worker’s add_constant
function. This function will produce an array of bytes which will tell the virtual machine how to access the literal value. Let’s add it.
// these are public methods add instructions and symbols to the worker
impl CompileWorker {
pub fn add_constant(
&self,
scope: SymbolScope,
scope_id: [u8; 2],
obj: CompileObject,
) -> Vec<u8> {
let operand = self.append_to_constant_pool(obj);
self.emit_opcode_with_parameter(OpCode::OpConstant, scope, scope_id, operand)
}
pub fn append_to_constant_pool(&self, obj: CompileObject) -> u8 {
let mut consts = self.constants.borrow_mut();
consts.push(obj);
let idx = consts.len() - 1;
idx as u8
}
fn emit_opcode_with_parameter(
&self,
code: OpCode,
scope: SymbolScope,
scope_id: [u8; 2],
operand: u8,
) -> Vec<u8> {
// [opcode scope scopeid terminal operand terminal ]
let mut instructions = vec![code as u8, scope as u8];
// add scope id
for id in scope_id.iter() {
instructions.push(*id);
}
// add terminal
instructions.push(OpCode::OpTerminal as u8);
// add operand
instructions.push(operand);
// add tail terminal
instructions.push(OpCode::OpTerminal as u8);
instructions
}
}
A literal value is treated as a constant since it is definitive. It won’t change unless it is mutated intentionally. We therefore add it to a constant pool which is a fancy way of saying a vector of constant values. Line 12 to Line 16 demonstrates that. We get back the index which is the location of the literal value. We use this location to now generate an instruction. The helper function emit_opcode_with_parameter
is responsible for producing the actual instruction. The instruction will look this:
[ 10, 1, 233,1, 248, 0, 248 ]
10 - the byte for the `OpConstant` operation
1 - the byte for global scope
[233,1] - two bytes representing the scope id
248 - the byte representing a terminal
0 - the byte representing the location of the literal value
248 - the byte representing a tail terminal
Once we have the instructions, we will not add them to the instruction pool. This is so because literals are usually assigned to a variable. Therefore we will return the generated instruction and delegate it to the assignee variable to know how to access it. That is all that is needed to compile literal value.
Overview of name resolution strategy
Up until now, we have only considered getting our feet wet with how to compile. That changes now. Recall that values don’t exist on their own. Commonly, they are assigned to a variable. Our job will be to keep track of these variables while taking into consideration the following:
- Variables can belong to different scopes
- Variables can have the same name even in different scopes
This makes our problem space very interesting. Luckily, our initial implementation addresses the first concern. Recall instructions for primitives begin with an opcode
followed by a scope byte and two bytes representing scope id. That means we only need to work out how to match scope ids. For the second concern, if a variable name exists in scope, all that we need to do is update the value of the variable. The choice of which data structure to use to store variables is paramount. In our case, we will use a HashMap
that will serve as our symbols table.
SymbolTable implementation
We introduce a new type SymbolsTableTyping
. This type is what we will add the the CompilerWorker
as a new field representing the symbol table.
pub type SymbolsTableTyping = Rc<RefCell<SymbolStore<Vec<u8>, SymbolStoreValue<u8>>>>;
The type is a Rc-RerCell
wrapping a yet to be created struct SymbolStore
. Like we stated earlier, we will hash map, therefore the SymbolsStore
will be contain one value which will be a hash map. Since we want to have a generic SymbolStore
, the definition of the SymbolStore
will take a generic for the key and value.
#[derive(Debug, Clone)]
pub struct SymbolStore<K, V>(pub HashMap<K, V>);
Notice the HashMap field of SymbolsStore
comes from the popular hashbrown
crate. This creates a small problem. We can’t implement BorshSerialize
and BorshDeserialize
using derive tags. Such will violate Rust’s rule not implementing foreign traits on foreign types. However, our implementation uses the new type pattern so that we can implement the trait manually.
We’ll also introduce a new struct to represent the value of the symbol table.
#[derive(Debug, Clone, BorshSerialize, BorshDeserialize)]
pub struct SymbolStoreValue<T: borsh::BorshSerialize + borsh::BorshDeserialize>(pub Vec<Vec<T>>);
We are now ready to add the new symbol table as part of the CompilerWorker
.
pub struct CompileWorker {
program_ast: Objects,
pub instructions: Rc<RefCell<Vec<Vec<u8>>>>,
pub constants: Rc<RefCell<Vec<CompileObject>>>,
pub scope_id_counter: Rc<RefCell<usize>>,
pub symbols_table: SymbolsTableTyping, // <--- Newly added symbols table
}
Now that the CompileWorker
is updated appropriately we need an API that will be used to add and retrieve values from the symbol table. This means introducing two
methods in the CompileWorker
implementation block.
pub fn add_symbol(&self, binding_key: Vec<u8>, symbol: Vec<Vec<u8>>) {
let mut symbols_table = self.symbols_table.borrow_mut();
symbols_table.0.insert(binding_key, SymbolStoreValue(symbol));
}
pub fn get_symbol(&self, binding_key: Vec<u8>) -> Option<SymbolStoreValue<u8>> {
let symbols_table = self.symbols_table.borrow();
8 if let Some(sy) = symbols_table.0.get(&binding_key) {
let symbol = sy.clone();
Some(symbol)
} else {
None
}
}
In Line 3, after retrieving the instance of the symbol table, we use the passed binding key and symbol to add an entry to the symbol table. In Line 8, we use the binding key to retrieve the value from the symbol table. If a value exists wrap the value as a Some
otherwise we return a None
. We chose an Option
type so that the downstream process can correctly adjust its implementation depending on the presence of a value.
Compiling assignments begins with intercepting the ASSIGN
token which represents intent to assign to a variable. The process follows these steps:
- Extract the binding names and convert them to bytes
- Call the appropriate compiler function for the right-hand side of the assignment
- Insert into the symbols table an entry using the binding names as bytes and the bytes from the second step
IdentifierKind::ASSIGN => {
let binding = self.left_child.as_ref().unwrap().as_ref().right().unwrap().as_ty_node().unwrap();
let binding_key = binding.variable_name.as_ref().unwrap();
let binding_key_as_bytes = binding_key.as_bytes().to_vec();
}
In Line 2 we extract the binding node from the left child of the ASSIGN
node. Consider the assign statement let num @int = 1;
, it’s node looks like this:
NODE(=)
NODE(LET) NODE(1)
The left child of the ASSIGN
node is where we get the variable name that is later converted into bytes.
let rhs = self.right_child.as_ref().unwrap();
let wrk = worker.borrow();
wrk.add_symbol(binding_key_as_bytes.clone(),vec![vec![OpCode::OpNull as u8]],);
In Line 1 we retrieve the right child of the ASSIGN
node. This is what will be compiled in subsequent steps. In Line 2 we get a copy of the CompileWorker
that instantiates an entry of the binding key with a null value. We do this so that future calls, especially recursive calls can access a value instead of throwing an error.
if let Some(instructions) = left_or_right(rhs, worker.clone(), scope, scope_id) {
// update the instructions in the binding table
wrk.add_symbol(binding_key_as_bytes, instructions);
None
} else {
None
}
In Line 1 we use the left_or_right
function and pass the right-hand side of the ASSIGN
node, a copy of the worker, scope and scope id. This function will match the passed object and call its compile function. Let’s see how it is defined.
fn left_or_right(
object: &Either<LiteralObjects, Box<Objects>>,
worker: Rc<RefCell<&CompileWorker>>,
scope: SymbolScope,
scope_id: [u8; 2],
) -> Option<Vec<Vec<u8>>> {
7 match object {
8Left(left) => match left {
LiteralObjects::ObjIntegerValue(int) => {
let int_lit = int.value.unwrap();
let obj = CompileObject::Integer(int_lit);
let wrk = worker.borrow();
let instructions = wrk.add_constant(scope, scope_id, obj);
Some(vec![instructions])
}
LiteralObjects::ObjBooleanValue(bool) => {
let bool_lit = bool.value.unwrap();
let obj = CompileObject::Boolean(bool_lit);
let wrk = worker.borrow();
let instructions = wrk.add_constant(scope, scope_id, obj);
Some(vec![instructions])
}
LiteralObjects::ObjStringValue(string) => {
let string_lit = string.value.as_ref().unwrap();
let obj = CompileObject::String(string_lit.to_string());
let wrk = worker.borrow();
let instructions = wrk.add_constant(scope, scope_id, obj);
Some(vec![instructions])
}
},
Right(right) => right.compile(worker, scope, scope_id),
}
}
We expect the passed object to be either a literal or another object. Hence we assert whether it’s a literal or an object as seen in Line 8 and Line 31. If the object is a literal value, we add it as a constant and then return instructions on how to access it. For objects, we call their compile function. This implementation is called recursively until instructions are generated. That’s all that is needed.
Compiling arithmetic operations is one of the simplest tasks we will encounter in this endeavour. It all starts by intercepting the operation
node, that is either PLUS
, MINUS
, ASTERISK
, SLASH
or MODULUS
. Let’s take PLUS
as an example:
IdentifierKind::PLUS => {
todo!("")
}
Once we have intercepted the presence of the PLUS
node, we extract the left-hand side of the operation and the right-hand side of the operation.
At this point, we are certain that both sides of the operation exist. Otherwise, it would have been caught by the parser.
let left = self.left_child.as_ref().unwrap();
let left = left_or_right(left, worker.clone(), scope.clone(), scope_id).unwrap();
let left = &left[0];
In Line 1 we extract the left-hand side of the operation from the left child of the node. In Line 2, we pass the extracted node to the left_or_right
function
which will do the hard work of compiling and returning appropriate instructions. We do the same process for the right-hand side of the operation.
let right = self.right_child.as_ref().unwrap();
let right = left_or_right(right, worker.clone(), scope, scope_id).unwrap();
let right = &right[0];
Once we have the instructions, we need to generate another vector of instructions that represents the intent of adding two values. The instructions will look like this:
[opcode terminal left terminal right terminal ]
operator separator operand separator operand tail separator
opcode
Notice that these instructions do not have a scope or scope id. Why? This is because the left and right operands already have the scope in their respective instructions.
let wrk = worker.borrow();
let instructions = wrk.add_infix(OpCode::OpAdd, left.to_vec(), right.to_vec());
Some(vec![instructions])
In Line 1 we get a copy of the work and then call the add_infix
function from the worker instance as seen in Line 2. Notice for the examples, the appropriate opcode is OpCode::OpAdd
. Other operators have their respective opcodes that point to their operation intention. Let’s see what the add_infix
function looks like:
pub fn add_infix(&self, code: OpCode, left: Vec<u8>, right: Vec<u8>) -> Vec<u8> {
let mut instructions = self.emit_opcode(code);
for l in left.iter() {
instructions.push(*l)
}
instructions.push(OpCode::OpNull as u8);
for r in right.iter() {
instructions.push(*r)
}
// we don't add the instructions into the CompileWorker instructions vector.
// Typically these instructions are RETURNED or PRINTED, hence they will be consumed by
// those OpCodes
instructions
}
The logic is quite simple. In Line 2 we initialize the instructions vector with the given opcode. After that we iterate over the operands appending each individual byte into the instruction vector. In the last line, we don’t add the generate instructions to the instruction vector of the CompileWorker
. While that would be ideal, it does not suit for purpose for the use case. The generated instructions are passed to the assignment or return statement. Consider this:
let sum @int = 10 + 30;
return sum;
In the Line 1, the instructions generated from the arithemetic operations, belong to the variable sum
. Hence we add sum into the symbols table. In Line 2 the return intent references the sum
instructions which already exist in the symbols tables. This demonstrate that adding instructions from an arithemetic operation have no place in the CompileWorker
’s instructions vector.
Compiling conditional operators involve a similar approach to compiling arithmetic operations. Conditional operators have a left and right-hand side.
That means we compile each side the generate the final instruction that will be sent back to the assignment operator. First, we intercept the appropriate
nodes, which can be LT
, GT
, EQ
, NOTEQ
, GTOREQ
, LTOREQ
,AND
,OR
, LAND
, LOR
or BANG
.
Let’s take LT
as our working example.
IdentifierKind::LT => {
todo!("")
}
Next, we extract the left and right operands of the node. We do this by accessing the left_child
and right_child
attributes of the node.
let left = self.left_child.as_ref().unwrap();
let left = left_or_right(left, worker.clone(), scope.clone(), scope_id).unwrap();
let left = &left[0];
let right = self.right_child.as_ref().unwrap();
let right = left_or_right(right, worker.clone(), scope, scope_id).unwrap();
let right = &right[0];
In Line 1 and Line 4, we retrieve the left and right operands from the respective attributes. In Line 2 and Line 6 we pass the left and right operands
to the left_or_right
function that compiles and returns the appropriate instructions for each operand.
let wrk = worker.borrow();
let instructions = wrk.add_infix(OpCode::OpLessThan, left.to_vec(), right.to_vec());
Some(vec![instructions])
In Line 1 we get a copy of the CompileWorker
and call it’s add_infix
function in Line 2. Finally we return the generated instructions so that they can be part of an assignment or a return statement. With that, the compiler is capable to handle conditional operators.
Function definitions provide an interesting pedigree for problem set. This is so because functions come in many different levels of complexity. Scope variables, arithemtic operations, condition operations, return statements and even recursion. The trick is that each problem segment is handled in isolation that composed into one big solution set. Let’s see how this is done.
IdentifierKind::FUNCTION => {
let empty_params = Vec::new();
let func_params = match self.func_params.as_ref() {
Some(p) => p,
None => &empty_params,
};
let empty_function_err: Result<(), KarisError> = Err(KarisError {
error_type: KarisErrorType::MissingFunctionBody,
message: "Function body is empty. Nothing to run".to_string(),
});
if self.block_children.is_none() {
eprintln!("{:?}", empty_function_err);
process::exit(0x0200)
}
if self.block_children.as_ref().unwrap().is_empty() {
eprintln!("{:?}", empty_function_err);
process::exit(0x0200)
}
}
In Line 2 we instantiate an empty vector. This vector will stand-in if params are not found in the function definition as seen from Line 3 to Line 5. This is so because not all functions have parameters in their definition. The next step is to validate that the function definition containes no errors in it’s syntax. From Line 8 to Line 11 we define a generic error that will be reused if an error is encountered. The first check is to assert whether the function has any children defined in it’s block. In Line 12 we do that. If the function has no children, if means the definition is incorrect, therefore we retain an error. Additionally, we check if the vector containing the function’s children is empty as well. While the possibility of this happening is slim, we employ defensive programming techniques just to guard our paranoia as seen in Line 17. Let’s add more logic.
let mut function_instructions = Vec::new();
let wrk = worker.borrow();
let func_scope_id = wrk.generate_scope_id_from_scope(SymbolScope::Local);
In Line 1 we create a new vector. This vector will hold instructions that represent the entire function. In Line 3 we generate the scope id of the function. This scope id will be passed to any variables that may be defined within the function body.
for param in func_params {
// check that literals are not in the function definition. It's an invalid syntax
if param.is_left() {
panic!("Invalid parameter type: {:?}", param)
}
// param will always be a `Objects` (RIGHT of Either)
7let obj = param.clone().right().as_ref().unwrap().clone();
if let Some(instructions) = function_or_caller_object_param_access_instructions(
&obj, worker.clone(), SymbolScope::Local, scope_id, OpCode::OpGetFunctionParameter, ) {
function_instructions.push(instructions);
}
}
In the next step, we iterate over any parameters in the function definition. We do this so that the compiler can figure out how to access the parameters.
In Line 3, if by any chance the parameter is a literal, we crash the program because it’s any error we can not recover from. From Line 7 to Line 11 we retrieve the actual parameter the generate instructions to access them from the compiler. We then add this instruction to the function_instructions
vector defined earlier.
function_instructions.push(vec![OpCode::OpNull as u8]);
let block_children = self.block_children.as_ref().unwrap();
for body_item in block_children.iter() {
let inst = body_item.compile(worker.clone(), SymbolScope::Local, func_scope_id);
if let Some(all_insts) = inst {
for i in all_insts.iter() {
function_instructions.push(i.clone())
}
}
}
In Line 1 we add a separator instruction to the function_instructions
vector. We do this so that the virtual machine can be aware of the boundary separating
parameters with actual functional instructions. From Line 3 to Line 10 we iterate over the function body items, compile them individually then add their respective instructions to the function_instructions
vector. With that, we have introduced the capability of compiling functions in the compiler.
Function calls are another interesting problem set. There are a few things we need to take into account.
- Is the function present in the global scope?
- If the call requires arguments, where are the arguments defined?
- What is the calling scope?
Let’s put these concerns into context. Consider the following example snippets:
function sum1() {
return 10 + 20;
}
function sum2(x, y) {
return x + y;
}
function sum3() {
const fn = function (x, y) {
return x + y;
};
fn(10, 20);
}
In the function sum1
, no arguments are passed to the function rather it returns the sum of values not associated with a variable. The function sum2
on the other hand takes a set of arguments. The question is, how do we access the values behind the variables? In the function sum3
, an inner function is defined that takes arguments and sums them. In this instance how should the inner be compiled? How will it be added to the symbols table? Fortunately for us, scenario three is not supported because in Karis closures are not supported.
Compilation implementation
Let’s see how we can make it work in Karis.
IdentifierKind::CALLER => {
if scope.is_global() {
eprintln!("Incorrect function call. A function should only be called inside a function block or main block");
process::exit(0x0100)
}
}
After we initialize the inceptor, we first check if the scope is global or its function scope. In Karis, a function can only be called inside the main block or inside a function block as in the case of recursive functions.
let func_name = self.variable_name.as_ref().unwrap();
let func_name_as_bytes = func_name.as_bytes().to_vec();
In Line 1 we extract the variable name from the current node. This variable name points to the name of the function that will be called. We then convert this variable name which is in a string format to a vector of bytes. Recall keys in the symbol table are a form of a vector of bytes. This vector of bytes will be used to retrieve the actual function definition.
if let Some(_function_symbol) = wrk.get_symbol(func_name_as_bytes.clone()) {
let mut caller_instructions = Vec::new();
let wrk = worker.borrow();
4 let caller_scope_id = wrk.generate_scope_id_from_scope(SymbolScope::Local);
let caller_def = wrk.instructions_for_caller(scope.clone(),caller_scope_id,func_name_as_bytes);
// the first set of instructions will point to the actual function that will be called
7 caller_instructions.push(caller_def);
// subsequent instructions will point to the call parameters if the function requires any.
9 if let Some(caller_params) = &self.call_params {
for param in caller_params.iter() {
call_parameter_map(param,worker.clone(),&mut caller_instructions,
scope.clone(),caller_scope_id,OpCode::OpGetCallerParameter,);
}
}
Some(caller_instructions)
}
In Line 1 we use the vector of bytes to see if it exists in the symbols table. In Line 4 we generate the scope id of the function caller. This will later be utilized to identify function callers. We then generate the first caller instruction from the caller scope id which we push into the caller_instructions
vector. In Line 9 we check if the caller node has parameters. If so, for each parameter we map how it can be accessed. With that, we are set to compile function calls.
Karis supports arrays as the composite type. As such, compiling an array requires a different way of thinking. Let’s break down how we will approach compiling an array. Consider the following example array:
let nums [ @int ] = [ 1, 2, 3, 4 ];
The first thing that we notice is that the values of the array are of the same type. That is, if the array is defined to contain integers, it can not contain any other types. The parser should catch such a violation. The second thing that we notice is that the values of the array are primitives. That is we don’t expect a function call or another composite type to be part of the array. With this information, our approach will be simple, robust and flexible to adjustment. This is how we approach it:
- We create a vector array that will store the instructions of the array
- We iterate over each element of the array, generating instructions on how to access the element. Since the values are constants, the
OpContant
opcode is going to feature prominently - Since the array is bound to a variable, we generate a random name for the array and use it as the key in the symbol table. We then add the array instructions to the symbol table
- Using the random name, we generate instructions of opcode
OpArray
that tell the virtual how to access the actual array and its values.
Compilation of Arrays implementation
Let’s now implement the procedure:
IdentifierKind::ARRAY => {
let mut array_items_instructions = Vec::new();
let array_items = self.block_children.as_ref().unwrap();
}
As usual, we begin by intercepting the ARRAY
node as seen in Line 1. In Line 2 we instantiate an empty vector that will store instructions for the array. we follow this up by retrieving the children of the array from the block_children
field of the node.
for item in array_items.iter() {
let item: Either<LiteralObjects, Box<Objects>> = Right(Box::new(item.clone()));
let item_as_instruction = left_or_right(&item, worker.clone(), scope.clone(), scope_id).unwrap();
// since we are sure array items will only be literals, we pick the first from the instructions vector
let item_as_instruction = item_as_instruction.first().unwrap();
let item_as_instruction = item_as_instruction.clone();
array_items_instructions.push(item_as_instruction);
}
We then iterate over each array item and pass them to the left_or_right
function that calls their corresponding compile
implementation. This call returns instructions for the array item which we append to the array_items_instructions
vector as seen in Line 7.
let binding_key = random_string_id();
let binding_key_as_bytes = binding_key.as_bytes().to_vec();
let wrk = worker.borrow();
wrk.add_symbol(binding_key_as_bytes.clone(),array_items_instructions.clone());
In Line 1 we utilize the random_string_id
function to generate a unique random name for the array. The name is composed of only integers. Then we add the array instructions to the symbol table using the generated name as the key.
let array_instructions = wrk.instructions_for_array(scope, scope_id, binding_key_as_bytes);
We the array instructions in the symbols table, the next step is to generate another vector of instructions that will now be returned to the caller. The caller in this context is the ASSIGN
node. We use the instructions_for_array
function to generate these instructions. Let’s implement it.
pub fn instructions_for_array(&self,scope: SymbolScope, scope_id: [u8; 2], binding_name: Vec<u8>,) -> Vec<u8> {
self.emit_opcode_for_builtin(OpCode::OpArray, scope, scope_id, binding_name)
}
fn emit_opcode_for_builtin( &self, code: OpCode, scope: SymbolScope, scope_id: [u8; 2], binding_name: Vec<u8>,) -> Vec<u8> {
// [opcode scope scopeid terminal binding_name terminal]
let mut instructions = vec![code as u8, scope as u8];
// add scope id
for id in scope_id.iter() {
instructions.push(*id);
}
// add terminal
instructions.push(OpCode::OpTerminal as u8);
// add binding_name
for u in binding_name.iter() {
instructions.push(*u);
}
// add tail terminal
instructions.push(OpCode::OpTerminal as u8);
instructions
}
The emit_opcode_for_builtin
function in Line 5 does the actual work of generating the instructions. We expect the instructions to be of the form:
[ OpCode Scope Scope Id Terminal Binding Name Terminal]
As you can see the logic is quite straightforward. All we do is set the appropriate opcode, iterate over the binding name bytes then return the vector of instructions.
When the instructions are returned to the ASSIGN
node, it will add another entry to the symbol table.
Compilation of return statements
Return statements are not as involved compared to array types. This is so because return statements only point to other instructions that have already been generated. Therefore, in this context, we only set the appropriate opcode and our work is done.
IdentifierKind::RETURN => {
let right = self.right_child.as_ref().unwrap();
let instructions = left_or_right(right, worker.clone(), scope.clone(), scope_id).unwrap();
let wrk = worker.borrow();
let insts = wrk.instructions_for_return(scope, scope_id, instructions);
Some(vec![insts])
}
In Line 3 is where actual instructions for the RETURN
node child are generated. The child can be as simple a variable and as complex as chained arithemetic operation. The generated instructions in Line 5 will be of the form:
[ OpCode Scope Scope Id Terminal Instructions to return Terminal]
With that, we are set to compile arrays and return statements.