October 5, 2022

Crafting a compiler in Rust : Chapter 2

Have you ever wondered how programming languages like Python, Java or Ruby work? Do you have an interest in building your own language? If so, this series of posts is for you. We dive into building a novel programing language from first principles. Get ready to have your hands dirty. It’s going to be a fun ride. Access the final product here : https://github.com/daviddexter/karis-programming-language

In this chapter you will get introduced to the core pillars of virtual machine construction and the step-by-step process we will undertake to craft our compiler. Ensure you have a working installation for Rust.

Why build a virtual machine?

Portability

The output of every compiler is an asset that the computer will execute. This asset depends on some factors, for instance, the architecture of the target machine and the operating system the compiler is running on. Let’s demonstrate this with simple program

fn main() {
    println!("my program is running");
}

When we compile the program by running the command cargo build, it deposits as executable asset to the target/debug directory. The program will ran because the generated executable asset targeted the architecture in which the compiler resides.

The question is, what is the machine’s architecture on which the Rust compiler is currently running?

We can get this information using Linux utilities lscpu and uname.

$ lscpu | grep -i architecture
$ uname -a

The output states that the architecture of the machine that the Rust compiler is currently running on is of type x86_64 meaning our program can only run on Linux-powered 32-bit and 64-bit machines. It will fail if we run it on perhaps a Windows or macOS machine. To produce executable assets that can run on either a Windows or macOS machine, we need to rerun the compiler; this time, we state which machine architecture we will target. To accomplish this with the Rust compiler, we’ll add a target for the target architecture.

Try the following command to add a Windows target:

$ rustup target add x86_64-pc-windows-gnu

By running the following command, let’s verify that the target was correctly installed.

$ rustup target list | grep -i x86_64-pc-windows-gnu

The command asserts the new target is available, and we now have the tools to compile for a Windows architecture. To build an executed for Windows, we therefore need to specify the target in the build command.

$ cargo build --target=x86_64-pc-windows-gnu

:frowning: That was unexpected!!!

Well!!! Not entirely. The error is telling us that we need to install a few more tools on our machine so that we can be able to compile for a Windows architecture. Examining, we see that a linker is missing. What is a linker? A linker is a low-level tool that combines the generated object codes from the compilation step into one file. It’s the one responsible for ensuring that after compilation is complete, we have only one executable asset. Let’s set up the missing tools and recompile which should now pass:

$ apt-get install mingw-w64 mingw-w64-common mingw-w64-i686-dev mingw-w64-tools mingw-w64-x86-64-dev
$ cargo build --target=x86_64-pc-windows-gnu

This exercise demonstrates that compiling to single executable yields architecture-specific assets.If we need to compile for a different architecture, we have to rerun the compilation step, this time targeting the desired architecture. This step of verbose compilation speaks to the value of a virtual machine. With a virtual machine, we only need to compile the program, and we will be guaranteed to will everywhere our virtual machine exists. Essentially, our programs will be portable.

Custom target language

In our definition of a compiler, we stated that a compiler translates a program to a target language. A target language is a language a computer can natively understand and execute without intermediate translation. But what if we don’t want to produce a machine or assembly code? What if we’re going to target a language that doesn’t exist? A custom target language of somewhat. Virtual machines come into play here. Since a virtual machine is built to handle a specific target language, we can use it to support our purposes.However, it’s for this reason that virtual machines are not universal. For instance, a virtual machine for Ruby would not work with Java.

The road ahead

Construction workflow

A typical compiler derives its structure from three simple observations:

  1. It has a frontend to deal with the source language. It reads the source text and asserts that the source follows a set of governing rules.

  2. It has a backend that generates output for that target language.

  3. It has an intermediate form connecting the frontend and backend, representing a formal structure of the program that is independent of either the source language or the target language.

Above is a conceptual summary of what we seek to achieve.

Lexer

The lexer will deconstruct the source text into atomic constructs that are valid language units. Atomic constructs are the minor units of measure of a language. For instance, 1, ., ;, !, or ==. The lexer will signal an error when it encounters an unexpected construct.

Parser

The parser will walk through all the constructs and build a representation of our program. It will also add an extra layer of validation to constructs. Specifically, it will seek to answer the question:

Are the constructs arranged in the order expected by the language?

Evaluator

The evaluator will ingest output from the parser to derive meaning; for example, given a program with the structure, as below, what is the resulting work?

let one = 1;
let two = 2;

let sum = one + two;

Compiler and Virtual Machine

The compiler will translate out representation of the program into a different form that the virtual machine can execute. For our purposes, we will build both the compiler and the virtual machine at the same time. The reason for this is that they both depend on each other. We can’t make a compiler without a virtual machine that will work on its output. At the same time, we can’t build a virtual machine without a compiler to feed it.

What to expect at the end

At the end we should have a simple programming language that can do basic arithmetic and logic operations. We will write a simple program in the Karis language and run it, similar to what we would do with a language like Rust. While we target beginners in this exploration, a little knowledge of Rust programming language would help focus attention on compiler construction rather than Rust itself.

The Karis programming language

Setting the stage

Karis programming language is the product of our effort to construct a compiler. As such, we need to understand how the language will be structured so that we can design a compiler that will execute programs written in Karis programming language.

Note: The programs below only demonstrate how to write valid Karis programs. We do not need to run them directly since we don’t have the compiler yet.

Variables

Similar to contemporary programming languages, Karis stores values in variables. These are named locations where the compiler will refer to get the actual value. Karis is a statically typed language. It needs to know the data type to store values in memory. This information will guide the compiler in its operations. The let keyword defines every variable.

let name @string = "Karis";

The code above defines a variable name of type String. Variables in Karis are, by default, mutable to fit our purpose of learning.

// original
let name @string = "Karis";

// overwrite the previous value
let name @string = "Karis language";

Data types

Every value in Karis has a type. This type tells Karis what kind of data is under observation, so it knows how to work with that data. Karis intentionally supports a small subset of data types to make it approachable from a learning perspective. As such, Karis supports scalar types and only one compound type. Let’s explore this further.

Scalar types

We represent single values as a scalar. This concept is similar to other programming languages.

integer type

An integer is a number without a decimal part. In Karis, we annotate integers with the @int symbol.

let age @int = 20;

To limit the complexity associated with number values, Karis does not have floating-point values.

// this is invalid
let price @int = 2.99;

Karis makes no explicit distinction between signed and unsigned integer values. Let’s compare it with Rust. Signing tells the language whether an integer value lays how much further left of the cartesian plane. For instance, integer signing is explicit in Rust.

// unsigned integer
let age i32 = 10;

// signed integer
let difference u32 = -1000;

Karis is pretty relaxed in this aspect. We define a negative integer value and a positive integer value in the same way.

let age @int = 20;

let difference @int = -1000;
Booleans

A boolean is a value that tells how truthy it is. Only two values are possible, true or false. In Karis, we annotate booleans with the @bool symbol.

let isvalid @bool = true;

let isactive @bool = false;
Strings

Strings are complicated data types that deserve more credit than we give them. Each programming language has its unique mechanism for handling strings. However, popular programming languages take strings by representing them as an array of byte-size Unicode characters. Karis takes the same approach. Strings are annotated with the @string symbol and enclosed within double quotes only.

// valid
let name @string = "Karis";

// invalid
let name @string = 'Karis';
Compound types
Arrays

Arrays store a variable number of typed items in a collection. These items can be retrieved using unique indexes that each object possesses. Karis does not have an annotation for arrays. Instead, its contents must be annotated with the type to be stored. Arrays in Karis do not support storing values of multiple types.

// valid
let ages [@int ] = [1,2,3,4];

// invalid
let items [@int ] = [1,true,"karis" ];

Like in other languages, arrays in Karis are changeable using either push or pop. We append new items at the end of an array using the push method. pop method removes an item from the end of it.

Arithmtic operators

Karis has arithmetic operators like other programming languages for basic mathematical operations. Karis has five arithmetic operators;

  1. Addition
  2. Subtraction
  3. Multiplication
  4. Division
  5. Modulus

Note: The programs below only demonstrate how to write valid Karis programs. We do not need to run them directly since we don’t have the compiler yet.

Addition operator

In Karis, the + symbol is the operator that defines the intent to add to values on either side of the operator.

let x @int = 7;

let y @int = 5;

let result @int = x + y;

print(result); // result is 12

In comparison to Python, the Karis language does not support binary addition.

x = 7
y = 5
result = x ++ y
#        # ^ this is invalid in Karis. It will throw a syntax error
print(result)

Subtraction operator

In Karis, the - symbol is the operator that defines the intent to subtract a right-hand side value from the left-hand side value.

let x @int = 7;

let y @int = 5;

let result @int = x - y;

print(result); // result is 2

Similar to binary addition, Karis language does not support binary subtraction.

x = 7
y = 5
result = x -- y
#         # ^ this is invalid in Karis. It will throw a syntax error
print(result)

Multiplication operator

In Karis, the * symbol states the intent to find the product of values on either side of the operator.

let x @int = 7;

let y @int = 5;

let result @int = x * y;

print(result); // result is 35

What about **? Karis language does not support such kind of syntax. Compared to Python, where ** indicates finding an exponent of a number, Karis will return an error.

x = 7
y = 5
result = x ** y
#         # ^ this is invalid in Karis. It will throw a syntax error
print(result)

Division operator

In Karis, the / symbol is the operator that defines the intent to split the left-hand value into equal parts of the right-hand side value.

let x @int = 35;

let y @int = 5;

let result @int = x / y;

print(result); // result is 7

It’s important to recall that Karis intentionally limits its number representation to integers values only. Let us use Python to see the difference. When we translate the above to Python, we see that a division in Python runs on a float, not an int.

x = 35 # <- this is an int `<class 'int'>`
y = 5 # <- this is an int `<class 'int'>`
result = x / y; # <- returns a float instead of int
print(result)

If we want to run the division over <class 'int'>, we have to use the operator //.

result = 35 // 5
print(result)

In Karis, we don’t care about float or int division. A division in Karis is performed on a number whose internal representation is an int. Hence, // is treated as invalid.

Modulus operator

In Karis, the % symbol is the operator that defines the intent to find the remainder when the left-hand side value and right-hand-side value after a division.

let x @int = 35;

let y @int = 5;

let result @int = x % y;

print(result); // result is 0

Operator precedence

Each arithmetic operator has precedence. This precedence indicates if the operator can is invokable before other operators. For instance, a multiplication operator (*) has a higher priority than an addition operator (+). Remember BODMAS from math!!!

Arithmetic operations can be grouped with parentheses to have a chain of operations. Similarly, grouped operations have higher precedence than standalone operators.

let x @int = 10;

let y @int = 5;

let z @int = 2;

let result @int = (x * y) - (y * z);

print(result); // result is 40

Operator precedence will become of value when we start building our parser, whose core premise is figuring out how to structure operators with different levels of operations importance.

Comparison operators

Similar to other modern programming languages, Karis supports familiar comparison operators that work on integers.

| Notation | Meaning | | > | Greater than | | < | Lesser than | | >= | Greater than or equal to | | <= | Less than or equal to |

Note: The programs below only demonstrate how to write valid Karis programs. We do not need to run them directly since we don’t have the compiler yet.

Greater than operator

The > operator returns true if its left-hand side is more significant than its right-hand side, false otherwise.


let ten @int = 10
let five @int = 5;

let result @bool = ten > five;

print(result);  // prints true

Less than operator

The < operator returns true if its left-hand side is less than its right-hand side, false otherwise.

let ten @int = 10
let five @int = 5;

let result @bool = ten < five;

print(result);  // prints false

Greater than or equal to operator

The >= operator returns true if its left-hand side is greater than or equivalent to its right-hand side, false otherwise.


let ten_1 @int = 10
let ten_2 @int = 10;

let result @bool = ten_1 >= ten_2;

print(result);  // prints true

Less than or equal to operator

The <= operator returns true if its left-hand side is less than or equivalent to its right-hand side, false otherwise.

let ten_1 @int = 10;
let ten_2 @int = 10;

let result @bool = ten_1 <= ten_2;

print(result);  // prints true

Bitwise operators

Like other programming languages, bitwise operators in Karis work on boolean values or expressions. A boolean expression produces a boolean value after evaluation. Consider:


let value @bool = true;     // this is a boolean value
let exp @bool = 10 > 5;     // this is a boolean expression

Negation operator

The ! operator produces the opposite of a value or expression.


let active @bool = true;
let not_active @bool = !active;

print(not_active);  // prints false

Bitwise AND operator

The & computes the logical AND by evaluating both sides of the operator. If both sides evaluate to be true, then the operator returns true, and false otherwise.


let x @bool = true;
let y @bool = false;
let z @bool = true;
let result1 @bool = x & y;
let result2 @bool = x & z;

print(result1);  // prints false
print(result2);  // prints true

Bitwise OR operator

The | computes the logical OR by evaluating both sides of the operator. If either side evaluates as true, then the operator returns true and false otherwise.


let x @bool = true;
let y @bool = false;
let z @bool = false;
let result1 @bool = x | y;
let result2 @bool = x | z;

print(result1);  // prints true
print(result2);  // prints false

Overview of conditional operators

Conditional operators evaluate the operator’s right-hand side only when necessary. These operators short-circuit the bitwise operations of their bitwise counterparts.

AND operator

Given the expression x && y, the operator evaluates the left-hand side first. If it’s true, it considers the right-hand side and returns the result. If it’s false, the right-hand side is skipped.

let x @bool = true;
let y @bool = false;
let z @bool = true;
let result1 @bool = x && y;
let result2 @bool = x && z;

print(result1);  // prints false
print(result2);  // prints true

OR operator

Given the expression x || y, the operator evaluates the left-hand side first. If it’s false, it considers the right-hand side and then returns the result. If it’s true, the right-hand side is skipped.

let x @bool = true;
let y @bool = false;
let z @bool = true;
let result1 @bool = x || y;
let result2 @bool = x || z;

print(result1);  // prints true
print(result2);  // prints true

Functions

The primary means of passing data around in the Karis language is via functions. Karis allows us to specify the types of the input and output values of functions. We evaluate functions as expressions in Karis. That is, functions must produce a value after their internal operations.

Declaration

Functions in Karis are, by default, not top-level. They must be associated with a let binding to be valid. The binding also gives the function its name, which must be unique across the entire program.

let great @unit = fn(){
               // ^ keyword that indicates the start of a function definition
    print("Hello");
};

Arguments type annotation

Similar to other programming languages, Karis functions accept zero or one to multiple arguments. Since Karis is a strongly typed language, it expects annotated arguments to extract their respective typing information. When we declare a function, we add type annotations after each argument to state what types of parameters the function accepts. Arguments type annotations go after the argument name. Consider an example.

let add @int = fn(x @int, y @int){
    return x + y;
};

In Line 1, the function add has two arguments: x and y, each of type @int. We call the function by assigning it to a variable or discarding the returned value.

let sum @int = add(1,2); // the result is bound to sum variable
add(1,2);   // the result is not bound to any variable hence it is discarded

The compiler checks the function’s arguments and their respective type annotations before doing its operation to assert operation validity.

add(true,2);
  // ^ Argument of type '@bool' does not match parameter of type '@int'.

Optional arguments

The nature of Karis as a strongly typed language has no affordance for optional arguments in function declarations. While legitimate cases exist where such flexibility is good for developer experience, Karis intentionally opts out of supporting it. The compiler discards any unused function arguments.

Return Type Annotations

In typical scenarios, we should bind the result of the function to a variable. Return type annotations appear after the binding name of the function.

let add @int = fn(x @int, y @int){
      // ^ explicit type annotation of the result of a function
    return x + y;
};

Sometimes, however, we want to define functions that do not return anything useful. Maybe we are doing some work and printing the result to the standard output. Or we call another function that does not define its return type annotation. In such a case, the idiomatic way in Karis language is to utilize the @unit return type. It’s a particular type that tells the compiler to ignore the type of the result value from the operation. Let’s look at an example.

let great @unit = fn(){
       //  ^ Type annotation of the result
    print("Hello, world!"); // prints "Hello, world!"
};

Anonymous Functions

Unlike languages like JavaScript or Golang, Karis does not allow anonymous functions. Such will be treated as an illegal declaration by the compiler.

BAD

fn(){
   print("Hello");
}

The exception to this requirement is the main function which serves as the entry point for the program. The fn keyword indicates the start of a function definition.

Overview of control flows

At the heart of valuable programs is to make decisions based on input. Programs written in any language are not linear. There is a need for a mechanism to modify the control flow using familiar idioms like if/else, while and others. Let’s talk about them in Karis. To limit the scope of this course to learning, we limit Karis to support only one control flow, if/else. Let’s discuss it further.

if/else statement

If/else flow in Karis is similar to other programming languages. We feed it the input and expect a specific procedure to be run if the information matches a particular condition. Otherwise, we branch to the following process. Karis borrows the simplicity of Rust and Go’s conditional syntax of not needing the boolean condition to be enclosed in parentheses, making the syntax less verbose and a joy to read.

let max @int = fn(x @int, y @int){
    if x > y{
        return x;
    }else{
        return y;
    };
};

In Line 2, the if keyword initializes a condition check. x > y between the keyword if and { character is the actual condition that needs evaluation. Condition expressions are usually evaluated to boolean values. In this case, we check whether x is more significant than y, where x and y are variables representing values passed when calling the function. Notice in Line 4, the else keyword initializes the alternative if the condition above fails to equate to true. Precisely what we are used to from other programming languages. Karis borrows another pattern from Go of not requiring an else branch if the if branch ends with a return statement. Let’s rewrite the above to demonstrate this alternative variant.

let max @int = fn(x @int, y @int){
    if x > y{
        return x;
    };
     ^ Closes the branch

    return y;
};

Both variants have the same semantics meaning.

The question is, why does Karis allow two ways of achieving the same thing?

The answer lies in how Karis treats the if/else combination. The if/else is evaluated as a two-piece expression. One for the if branch and the other for the else branch giving us the flexibility to split and join the pieces using an implicit conditional OR operator.

Refer to the OR operator from the lesson “Comparison, Bitwise and Conditional Operators” for further context.

Consider a scenario where multiple if/else are chained, like below;

let message @unit = fn(age @int){
    if age <= 10{
        print("Hi junior")
    }else if age > 10 && < 20 {    <- inner if branch
        print("What's up!")
    }else{
        print("Hello sir")
    }
}

Line 2, Line 4, Line 6 are reduced to:

(if age <= 10) || ( (if age > 10 && < 20) || print("Hello sir") )
                    --------------------     -----------------
                         Second IF          Alternate of second IF

--------------    ----------------------------------------------
  First IF                    Alternate of first IF

The choice of what to employ in our program is a syntactic preference.

Unsupported features in Karis language

Karis language is a small programming language compared to mainstream programming language. As such, it has a small set of features to fit our purpose of learning rather than implementing a full-featured programming language.

String concatenation

String concatenation is the intent to join two separate string literals together. In Python, we accomplish string concatenation using the + symbol.

x = "French"
y = "fries"
z = x + " " + y

print(z)

Karis language does not support this since we only intend to use the + symbol for arithmetic operations.

Code comments

Modern programming languages come with mechanisms for developers to annotate documentation in the source code. These annotations come in varied forms, such as:

// this is a comment

/**

A multi-line comment
**/

An example of comments in Python.

## This is a comment

Karis language does not support commenting using any of the above forms.

Decimal data type

We limit Karis language to only support integers with no decimal parts freeing us from worrying about arithmetic precision and overflow issues.

Multi-file source code

Modern programming languages allow developers to split source code into multiple files. Such is helpful, especially when containerized logic flow. For example, Rust uses a module system to support multi-file source code. Go, on the other hand, has the concept of packages. In Karis language, all logic is in one source file. Implying we won’t support libraries or packages contrary to other modern programming languages.

Why have a limited feature set?

Let’s outline our justification for the limited set of features.

Reduce the lexical and parser space

The lexer and parser work together to determine whether the source code has valid constructs. Whenever we add something new, like a new keyword or a character that has special meaning to the language, both lexer and parser must be aware of making our lexer and the parser is overly complex for a relatively small language.

Reduce complexity in evaluating and compilation

Like the lexer and parser, each new addition to the language vocabulary has to be given meaning and represented correctly in memory. We, therefore, risk complicating the evaluator and compiler.

Limits our focus to compiler construction

The true purpose of this course is to learn about building compilers. Karis language is a necessary effect of that purpose. We therefore don’t need to dwell on having a full-fledged programming language.

Share

© Mwangi Kariuki 2019-2024