Have you ever wondered how programming languages like Python, Java or Ruby work? Do you have an interest in building your own language? If so, this series of posts is for you. We dive into building a novel programing language from first principles. Get ready to have your hands dirty. It’s going to be a fun ride. Access the final product here : https://github.com/daviddexter/karis-programming-language
In this chapter you will get introduced to the core pillars of virtual machine construction and the step-by-step process we will undertake to craft our compiler. Ensure you have a working installation for Rust.
Why build a virtual machine?
Portability
The output of every compiler is an asset that the computer will execute. This asset depends on some factors, for instance, the architecture of the target machine and the operating system the compiler is running on. Let’s demonstrate this with simple program
fn main() {
println!("my program is running");
}
When we compile the program by running the command cargo build
, it deposits as executable asset to the target/debug
directory.
The program will ran because the generated executable asset targeted the architecture in which the compiler resides.
The question is, what is the machine’s architecture on which the Rust compiler is currently running?
We can get this information using Linux utilities lscpu
and uname
.
$ lscpu | grep -i architecture
$ uname -a
The output states that the architecture of the machine that the Rust compiler is currently running on is of type x86_64
meaning our program can only run on Linux-powered 32-bit and 64-bit machines. It will fail if we run it on perhaps
a Windows or macOS machine. To produce executable assets that can run on either a Windows or macOS machine, we need to rerun
the compiler; this time, we state which machine architecture we will target. To accomplish this with the
Rust compiler, we’ll add a target for the target architecture.
Try the following command to add a Windows target:
$ rustup target add x86_64-pc-windows-gnu
By running the following command, let’s verify that the target was correctly installed.
$ rustup target list | grep -i x86_64-pc-windows-gnu
The command asserts the new target is available, and we now have the tools to compile for a Windows architecture. To build an executed for Windows, we therefore need to specify the target in the build command.
$ cargo build --target=x86_64-pc-windows-gnu
:frowning: That was unexpected!!!
Well!!! Not entirely. The error is telling us that we need to install a few more tools on our machine so that we can be able to compile
for a Windows architecture. Examining, we see that a linker
is missing. What is a linker? A linker is a low-level tool that combines
the generated object codes from the compilation step into one file. It’s the one responsible for ensuring that after compilation is complete,
we have only one executable asset. Let’s set up the missing tools and recompile which should now pass:
$ apt-get install mingw-w64 mingw-w64-common mingw-w64-i686-dev mingw-w64-tools mingw-w64-x86-64-dev
$ cargo build --target=x86_64-pc-windows-gnu
This exercise demonstrates that compiling to single executable yields architecture-specific assets.If we need to compile for a different architecture, we have to rerun the compilation step, this time targeting the desired architecture. This step of verbose compilation speaks to the value of a virtual machine. With a virtual machine, we only need to compile the program, and we will be guaranteed to will everywhere our virtual machine exists. Essentially, our programs will be portable.
Custom target language
In our definition of a compiler, we stated that a compiler translates a program to a target language. A target language is a language a computer can natively understand and execute without intermediate translation. But what if we don’t want to produce a machine or assembly code? What if we’re going to target a language that doesn’t exist? A custom target language of somewhat. Virtual machines come into play here. Since a virtual machine is built to handle a specific target language, we can use it to support our purposes.However, it’s for this reason that virtual machines are not universal. For instance, a virtual machine for Ruby would not work with Java.
The road ahead
Construction workflow
A typical compiler derives its structure from three simple observations:
It has a frontend to deal with the source language. It reads the source text and asserts that the source follows a set of governing rules.
It has a backend that generates output for that target language.
It has an intermediate form connecting the frontend and backend, representing a formal structure of the program that is independent of either the source language or the target language.
Above is a conceptual summary of what we seek to achieve.
Lexer
The lexer will deconstruct the source text into atomic constructs that are valid language units.
Atomic constructs are the minor units of measure of a language. For instance, 1
, .
, ;
, !
, or ==
.
The lexer will signal an error when it encounters an unexpected construct.
Parser
The parser will walk through all the constructs and build a representation of our program. It will also add an extra layer of validation to constructs. Specifically, it will seek to answer the question:
Are the constructs arranged in the order expected by the language?
Evaluator
The evaluator will ingest output from the parser to derive meaning; for example, given a program with the structure, as below, what is the resulting work?
let one = 1;
let two = 2;
let sum = one + two;
Compiler and Virtual Machine
The compiler will translate out representation of the program into a different form that the virtual machine can execute. For our purposes, we will build both the compiler and the virtual machine at the same time. The reason for this is that they both depend on each other. We can’t make a compiler without a virtual machine that will work on its output. At the same time, we can’t build a virtual machine without a compiler to feed it.
What to expect at the end
At the end we should have a simple programming language that can do basic arithmetic and logic operations. We will write a simple program in the Karis language and run it, similar to what we would do with a language like Rust. While we target beginners in this exploration, a little knowledge of Rust programming language would help focus attention on compiler construction rather than Rust itself.
The Karis programming language
Setting the stage
Karis programming language is the product of our effort to construct a compiler. As such, we need to understand how the language will be structured so that we can design a compiler that will execute programs written in Karis programming language.
Note: The programs below only demonstrate how to write valid Karis programs. We do not need to run them directly since we don’t have the compiler yet.
Variables
Similar to contemporary programming languages, Karis stores values in variables. These are named
locations where the compiler will refer to get the actual value. Karis is a statically typed language. It needs to know the data type to store values in memory. This information will guide the compiler in its operations. The let
keyword
defines every variable.
let name @string = "Karis";
The code above defines a variable name
of type String
. Variables in Karis are, by default, mutable to fit our purpose of learning.
// original
let name @string = "Karis";
// overwrite the previous value
let name @string = "Karis language";
Data types
Every value in Karis has a type. This type tells Karis what kind of data is under observation, so it knows how to work with that data. Karis intentionally supports a small subset of data types to make it approachable from a learning perspective. As such, Karis supports scalar types and only one compound type. Let’s explore this further.
Scalar types
We represent single values as a scalar. This concept is similar to other programming languages.
integer type
An integer is a number without a decimal part. In Karis, we annotate integers with the @int
symbol.
let age @int = 20;
To limit the complexity associated with number values, Karis does not have floating-point values.
// this is invalid
let price @int = 2.99;
Karis makes no explicit distinction between signed
and unsigned
integer values. Let’s compare it with Rust.
Signing tells the language whether an integer value lays how much further left of the cartesian plane. For instance, integer
signing is explicit in Rust.
// unsigned integer
let age i32 = 10;
// signed integer
let difference u32 = -1000;
Karis is pretty relaxed in this aspect. We define a negative integer value and a positive integer value in the same way.
let age @int = 20;
let difference @int = -1000;
Booleans
A boolean is a value that tells how truthy it is. Only two values are possible, true
or false
. In Karis, we annotate booleans with the @bool
symbol.
let isvalid @bool = true;
let isactive @bool = false;
Strings
Strings are complicated data types that deserve more credit than we give them. Each programming language has its unique mechanism for handling strings. However, popular programming languages take strings by representing them as an array of byte-size Unicode characters. Karis takes the same approach. Strings are annotated with the @string
symbol and enclosed within double quotes only.
// valid
let name @string = "Karis";
// invalid
let name @string = 'Karis';
Compound types
Arrays
Arrays store a variable number of typed items in a collection. These items can be retrieved using unique indexes that each object possesses. Karis does not have an annotation for arrays. Instead, its contents must be annotated with the type to be stored. Arrays in Karis do not support storing values of multiple types.
// valid
let ages [@int ] = [1,2,3,4];
// invalid
let items [@int ] = [1,true,"karis" ];
Like in other languages, arrays in Karis are changeable using either push
or pop
. We append new items at the end of an array using the push
method. pop
method removes an item from the end of it.
Arithmtic operators
Karis has arithmetic operators like other programming languages for basic mathematical operations. Karis has five arithmetic operators;
- Addition
- Subtraction
- Multiplication
- Division
- Modulus
Note: The programs below only demonstrate how to write valid Karis programs. We do not need to run them directly since we don’t have the compiler yet.
Addition operator
In Karis, the +
symbol is the operator that defines the intent to add to values on either side of the operator.
let x @int = 7;
let y @int = 5;
let result @int = x + y;
print(result); // result is 12
In comparison to Python, the Karis language does not support binary addition.
x = 7
y = 5
result = x ++ y
# # ^ this is invalid in Karis. It will throw a syntax error
print(result)
Subtraction operator
In Karis, the -
symbol is the operator that defines the intent to subtract a right-hand side value from the left-hand side value.
let x @int = 7;
let y @int = 5;
let result @int = x - y;
print(result); // result is 2
Similar to binary addition, Karis language does not support binary subtraction.
x = 7
y = 5
result = x -- y
# # ^ this is invalid in Karis. It will throw a syntax error
print(result)
Multiplication operator
In Karis, the *
symbol states the intent to find the product of values on either side of the operator.
let x @int = 7;
let y @int = 5;
let result @int = x * y;
print(result); // result is 35
What about **
? Karis language does not support such kind of syntax. Compared to Python, where **
indicates finding
an exponent of a number, Karis will return an error.
x = 7
y = 5
result = x ** y
# # ^ this is invalid in Karis. It will throw a syntax error
print(result)
Division operator
In Karis, the /
symbol is the operator that defines the intent to split the left-hand value into equal parts of the right-hand side value.
let x @int = 35;
let y @int = 5;
let result @int = x / y;
print(result); // result is 7
It’s important to recall that Karis intentionally limits its number representation to integers values only. Let us use Python to see the difference.
When we translate the above to Python, we see that a division in Python runs on a float
, not an int
.
x = 35 # <- this is an int `<class 'int'>`
y = 5 # <- this is an int `<class 'int'>`
result = x / y; # <- returns a float instead of int
print(result)
If we want to run the division over <class 'int'>
, we have to use the operator //
.
result = 35 // 5
print(result)
In Karis, we don’t care about float or int division. A division in Karis is performed on a number whose internal representation is an int
.
Hence, //
is treated as invalid.
Modulus operator
In Karis, the %
symbol is the operator that defines the intent to find the remainder when the left-hand side value and right-hand-side value after a division.
let x @int = 35;
let y @int = 5;
let result @int = x % y;
print(result); // result is 0
Operator precedence
Each arithmetic operator has precedence. This precedence indicates if the operator can is invokable before other operators.
For instance, a multiplication operator (*
) has a higher priority than an addition operator (+
). Remember BODMAS
from math!!!
Arithmetic operations can be grouped with parentheses to have a chain of operations. Similarly, grouped operations have higher precedence than standalone operators.
let x @int = 10;
let y @int = 5;
let z @int = 2;
let result @int = (x * y) - (y * z);
print(result); // result is 40
Operator precedence will become of value when we start building our parser, whose core premise is figuring out how to structure operators with different levels of operations importance.
Comparison operators
Similar to other modern programming languages, Karis supports familiar comparison operators that work on integers.
| Notation | Meaning | | > | Greater than | | < | Lesser than | | >= | Greater than or equal to | | <= | Less than or equal to |
Note: The programs below only demonstrate how to write valid Karis programs. We do not need to run them directly since we don’t have the compiler yet.
Greater than operator
The >
operator returns true
if its left-hand side is more significant than its right-hand side, false
otherwise.
let ten @int = 10
let five @int = 5;
let result @bool = ten > five;
print(result); // prints true
Less than operator
The <
operator returns true
if its left-hand side is less than its right-hand side, false
otherwise.
let ten @int = 10
let five @int = 5;
let result @bool = ten < five;
print(result); // prints false
Greater than or equal to operator
The >=
operator returns true
if its left-hand side is greater than or equivalent to its right-hand side, false
otherwise.
let ten_1 @int = 10
let ten_2 @int = 10;
let result @bool = ten_1 >= ten_2;
print(result); // prints true
Less than or equal to operator
The <=
operator returns true
if its left-hand side is less than or equivalent to its right-hand side, false
otherwise.
let ten_1 @int = 10;
let ten_2 @int = 10;
let result @bool = ten_1 <= ten_2;
print(result); // prints true
Bitwise operators
Like other programming languages, bitwise operators in Karis work on boolean
values or expressions.
A boolean expression produces a boolean value after evaluation. Consider:
let value @bool = true; // this is a boolean value
let exp @bool = 10 > 5; // this is a boolean expression
Negation operator
The !
operator produces the opposite of a value or expression.
let active @bool = true;
let not_active @bool = !active;
print(not_active); // prints false
Bitwise AND operator
The &
computes the logical AND by evaluating both sides of the operator. If both sides evaluate to be true, then
the operator returns true
, and false
otherwise.
let x @bool = true;
let y @bool = false;
let z @bool = true;
let result1 @bool = x & y;
let result2 @bool = x & z;
print(result1); // prints false
print(result2); // prints true
Bitwise OR operator
The |
computes the logical OR by evaluating both sides of the operator. If either side evaluates as true, then the operator
returns true
and false
otherwise.
let x @bool = true;
let y @bool = false;
let z @bool = false;
let result1 @bool = x | y;
let result2 @bool = x | z;
print(result1); // prints true
print(result2); // prints false
Overview of conditional operators
Conditional operators evaluate the operator’s right-hand side only when necessary. These operators short-circuit the bitwise operations of their bitwise counterparts.
AND operator
Given the expression x && y
, the operator evaluates the left-hand side first. If it’s true
, it considers the right-hand side
and returns the result. If it’s false
, the right-hand side is skipped.
let x @bool = true;
let y @bool = false;
let z @bool = true;
let result1 @bool = x && y;
let result2 @bool = x && z;
print(result1); // prints false
print(result2); // prints true
OR operator
Given the expression x || y
, the operator evaluates the left-hand side first. If it’s false
, it considers the right-hand side
and then returns the result. If it’s true
, the right-hand side is skipped.
let x @bool = true;
let y @bool = false;
let z @bool = true;
let result1 @bool = x || y;
let result2 @bool = x || z;
print(result1); // prints true
print(result2); // prints true
Functions
The primary means of passing data around in the Karis language is via functions. Karis allows us to specify the types of the input and output values of functions. We evaluate functions as expressions in Karis. That is, functions must produce a value after their internal operations.
Declaration
Functions in Karis are, by default, not top-level. They must be associated with a let
binding to be valid. The binding
also gives the function its name, which must be unique across the entire program.
let great @unit = fn(){
// ^ keyword that indicates the start of a function definition
print("Hello");
};
Arguments type annotation
Similar to other programming languages, Karis functions accept zero or one to multiple arguments. Since Karis is a strongly typed language, it expects annotated arguments to extract their respective typing information. When we declare a function, we add type annotations after each argument to state what types of parameters the function accepts. Arguments type annotations go after the argument name. Consider an example.
let add @int = fn(x @int, y @int){
return x + y;
};
In Line 1, the function add
has two arguments: x
and y
, each of type @int
. We call the function by assigning it to a variable or
discarding the returned value.
let sum @int = add(1,2); // the result is bound to sum variable
add(1,2); // the result is not bound to any variable hence it is discarded
The compiler checks the function’s arguments and their respective type annotations before doing its operation to assert operation validity.
add(true,2);
// ^ Argument of type '@bool' does not match parameter of type '@int'.
Optional arguments
The nature of Karis as a strongly typed language has no affordance for optional arguments in function declarations. While legitimate cases exist where such flexibility is good for developer experience, Karis intentionally opts out of supporting it. The compiler discards any unused function arguments.
Return Type Annotations
In typical scenarios, we should bind the result of the function to a variable. Return type annotations appear after the binding name of the function.
let add @int = fn(x @int, y @int){
// ^ explicit type annotation of the result of a function
return x + y;
};
Sometimes, however, we want to define functions that do not return anything useful. Maybe we are doing some work and printing the result
to the standard output. Or we call another function that does not define its return type annotation. In such a case, the idiomatic way in
Karis language is to utilize the @unit
return type. It’s a particular type that tells the compiler to ignore the type of the result value from the operation.
Let’s look at an example.
let great @unit = fn(){
// ^ Type annotation of the result
print("Hello, world!"); // prints "Hello, world!"
};
Anonymous Functions
Unlike languages like JavaScript or Golang, Karis does not allow anonymous functions. Such will be treated as an illegal declaration by the compiler.
BAD
fn(){
print("Hello");
}
The exception to this requirement is the main
function which serves as the entry point for the program.
The fn
keyword indicates the start of a function definition.
Overview of control flows
At the heart of valuable programs is to make decisions based on input. Programs written in any language are not linear.
There is a need for a mechanism to modify the control flow using familiar idioms like if/else
, while
and others.
Let’s talk about them in Karis. To limit the scope of this course to learning, we limit Karis to support only one
control flow, if/else
. Let’s discuss it further.
if/else
statement
If/else
flow in Karis is similar to other programming languages. We feed it the input and expect a specific procedure to
be run if the information matches a particular condition. Otherwise, we branch to the following process. Karis borrows the simplicity
of Rust and Go’s conditional syntax of not needing the boolean condition to be enclosed in parentheses, making the syntax
less verbose and a joy to read.
let max @int = fn(x @int, y @int){
if x > y{
return x;
}else{
return y;
};
};
In Line 2, the if
keyword initializes a condition check. x > y
between the keyword if
and { character is the actual condition
that needs evaluation. Condition expressions are usually evaluated to boolean values. In this case, we check whether x
is more
significant than y
, where x
and y
are variables representing values passed when calling the function. Notice in Line 4, the else
keyword initializes the alternative if the condition above fails to equate to true. Precisely what we are used to from other programming languages.
Karis borrows another pattern from Go of not requiring an else
branch if the if
branch ends with a return statement. Let’s
rewrite the above to demonstrate this alternative variant.
let max @int = fn(x @int, y @int){
if x > y{
return x;
};
^ Closes the branch
return y;
};
Both variants have the same semantics meaning.
The question is, why does Karis allow two ways of achieving the same thing?
The answer lies in how Karis treats the if/else
combination. The if/else
is evaluated as a two-piece expression. One for the if
branch and the other for the else
branch giving us the flexibility to split and join the pieces using an implicit conditional OR
operator.
Refer to the OR operator from the lesson “Comparison, Bitwise and Conditional Operators” for further context.
Consider a scenario where multiple if/else
are chained, like below;
let message @unit = fn(age @int){
if age <= 10{
print("Hi junior")
}else if age > 10 && < 20 { <- inner if branch
print("What's up!")
}else{
print("Hello sir")
}
}
Line 2, Line 4, Line 6 are reduced to:
(if age <= 10) || ( (if age > 10 && < 20) || print("Hello sir") )
-------------------- -----------------
Second IF Alternate of second IF
-------------- ----------------------------------------------
First IF Alternate of first IF
The choice of what to employ in our program is a syntactic preference.
Unsupported features in Karis language
Karis language is a small programming language compared to mainstream programming language. As such, it has a small set of features to fit our purpose of learning rather than implementing a full-featured programming language.
String concatenation
String concatenation is the intent to join two separate string literals together. In Python, we accomplish string concatenation
using the +
symbol.
x = "French"
y = "fries"
z = x + " " + y
print(z)
Karis language does not support this since we only intend to use the +
symbol for arithmetic operations.
Code comments
Modern programming languages come with mechanisms for developers to annotate documentation in the source code. These annotations come in varied forms, such as:
// this is a comment
/**
A multi-line comment
**/
An example of comments in Python.
## This is a comment
Karis language does not support commenting using any of the above forms.
Decimal data type
We limit Karis language to only support integers with no decimal parts freeing us from worrying about arithmetic precision and overflow issues.
Multi-file source code
Modern programming languages allow developers to split source code into multiple files. Such is helpful, especially when containerized logic flow. For example, Rust uses a module system to support multi-file source code. Go, on the other hand, has the concept of packages. In Karis language, all logic is in one source file. Implying we won’t support libraries or packages contrary to other modern programming languages.
Why have a limited feature set?
Let’s outline our justification for the limited set of features.
Reduce the lexical and parser space
The lexer and parser work together to determine whether the source code has valid constructs. Whenever we add something new, like a new keyword or a character that has special meaning to the language, both lexer and parser must be aware of making our lexer and the parser is overly complex for a relatively small language.
Reduce complexity in evaluating and compilation
Like the lexer and parser, each new addition to the language vocabulary has to be given meaning and represented correctly in memory. We, therefore, risk complicating the evaluator and compiler.
Limits our focus to compiler construction
The true purpose of this course is to learn about building compilers. Karis language is a necessary effect of that purpose. We therefore don’t need to dwell on having a full-fledged programming language.