Crafting a compiler in Rust : Chapter 1

Have you ever wondered how programming languages like Python, Java or Ruby work? Do you have an interest in building your own language? If so, this series of posts is for you. We dive into building a novel programing language from first principles. Get ready to have your hands dirty. It’s going to be a fun ride. Access the final product here : https://github.com/daviddexter/karis-programming-language

In this chapter, you will get a high-level overview of what compilers are, what they as used for and why build one from scratch. Additionally, you’ll acquire some fundamentals of virtual machines.

Introduction to compilers

As software developers, we read and write every day. Regardless of which high-level programming language we choose to employ, we use English constructs to describe intent.

For instance:

import sys

def hello(name):
  print("Hello "+ name)

if __name__ == "__main__":
  arg = sys.stdin.readlines()
  arg = [line.rstrip() for line in arg]
  hello(arg[0])

# Please type a name in the input field below

Consider the python program that expects a name, then prints a string concatenated with the provided name. There is no ambiguity in what is going on. We write the program in constructs similar to human language. We agree that software engineering involves translating complex steps into small steps written in human language constructs. A few questions arise:

“how does a computer understand what we have just written?”

“Is it that the computer understands how we write?”

“Can we write a program in some exotic human language and expect it to function as directed?”

Sadly, for all the above questions, no. At least not right now. So how do computers understand what we write? Human languages are broad. While English may be the default language for business and socializing, hundreds of human languages exist. Imagine a head of state from a non-English speaking country visiting a head of state of an English-speaking country. How will these individuals communicate with each other yet share no common language? You guessed it! Translators. Translators are professionals who have mastered multiple languages and can translate one human language to another.

Let’s bring the context back to computers. Here is our use case. We want to translate a programming language written with the human language constructs into another language that a computer can understand. Essentially, we want a translator. This translator is what we call a compiler. We can define a compiler as a piece of software that translates source code written in a programming language to a target language that a computer can understand and execute. The target language that the compiler produces is usually low-level, like machine code or assembly language.

Compilers do more than translate. They are also responsible for optimizing the output for fast execution. There is a lot of magic in optimization. #key# LLVM: () #key#, #key# GCC: () #key# & #key# Clang: () #key# are prominent compilers that have served the industry for years.

How are they different from interpreters?

The line between a compiler and an interpreter is blurry. An interpreter does the same thing a compiler does. However, instead of outputting machine code, it executes the instructions on the fly. An interpreter can do this because, unlike a compiler, it translates and executes statements line-by-line. Programming languages like Python and Ruby require interpreters. That is why programs written in these languages are ready to run immediately without requiring a translation step. In contrast, languages like Go or Rust need compilers to convert the instructions written into executable assets.

Why build a compiler?

Compiler construction is not everyone’s chill pill. It requires a different level of thinking and skillset. Why would we consider building one in the first place? For one, great craftsman understands their tools. The same is true for programmers who also interact with the compilers almost daily. Understanding how compilers translate programs to machine code will make us great at optimizing and debugging our programs. By understanding compiler construction, you are in a better position to create your programming language. Such is what we aim to achieve at the tail end of this course. If the above still doesn’t compel you, consider this; it’s fun to build something others deem daunting and complex. All in all, the key motivation is learning.

What is a virtual machine?

A virtual machine in its simplest form is a computing system whose purpose is to execute programmable instructions. From a conceptual perspective, a virtual machine is similar to an actual device, with the only difference being that it is the software rather than a physical device.

Types of virtual machines

Let’s take a brief detour to describe the types of virtual machines and their practical characteristics.

Full-instructions-set virtual machine

These types of virtual machines offer the full capabilities of an operating system. They interact with the hardware, e.g. hard disks, graphics cards, sound cards and network cards, to give the same experience that a physical machine like a laptop or desktop would provide. VirtualBox is the most popular virtual machine that fits this category.

Application binary interface

These virtual machines offer an interface for guest applications to run side-by-side with native applications of that virtual machine. A typical example of this type of virtual machine is an emulator provided by either Android or iOS to debug applications during development.

Language virtual machine

These virtual machines are associated with specific languages to provide runtime capabilities to execute instructions written in those languages. Examples of Language virtual machines include #key# JavaScript engine: () #key# and #key# JVM: () #key#.

Simplifying virtual machines

For our purposes, when we say virtual machine, we mean a Language virtual machine. From the definition above of a virtual machine:

From a conceptual perspective, a virtual machine is similar to an actual device, the only difference being that it is in software rather than a physical machine.

A few things to remember here are; Virtual machines are built-in software. Essentially, that means replicating the capabilities of a physical computer in software. Let’s dig in a little further by exploring how a physical computer works. All modern computers base their designs on concepts developed by John von Neuman: referred to as von Neumann architecture.

The ideas of this architecture are:

A single read-write memory for data and instructions.
Data and instructions in memory are addressable without regard to the data or instruction itself.
Execution of instructions occurs on a step-by-step basis. One instruction at a time.

To grasp the concepts of this architecture, let’s take a brief detour to observe computing from the lens of history. Back then, a computer was composed of hardware units that handled specific tasks. For instance, if you want a computer to perform some arithmetic operation, real hardware is developed with procedures for running such an operation. This hardware was then plugged into a computer so that it may be able to receive trigger signals. The resulting program is in the form of hardware, termed a hardwired program.

What von Neumann architecture proposed is a construction of a general-purpose computer capable of handling varied arithmetic and logical functions. This general-purpose computer would accept data and control instructions and then produce an output due to its operations. Thus, instead of rewiring the hardware for each new program, the programmer merely needs to supply a new set of control signals. These signals are sequentially coded instructions that now the general-purpose computer can decode and execute. These sequentially coded instructions are what we have now come to know as software.

The above summarises the architecture. Instruction code consumes input via some input device. The instruction interpreter converts the input to instructions. The control signal emits an execution signal to the operation unit, which executes the instruction. The cycle continues until the execution of all instructions is complete. The CPU must have some memory where instructions are stored and retrieved. Hence, modern computers have three core components; memory, CPU and I/O. These components are connected in some fashion to achieve a computer’s primary function, which is to execute programs.

The CPU is the centrepiece of a computer. It handles all workloads given to the computer. Internally, the CPU utilizes:

Memory address register specifies the subsequent read/write location.
Memory buffer register, which contains data to be written or read from memory
I/O address register to locate connected I/O devices
I/O buffer register to move data to and from the I/O devices.

The memory serves as a location where the CPU stores and retrieves instructions and data. This storing and retrieval has to be very fast and in no particular order therefore, it is called Random Access Memory (RAM). The I/O components transfer data from external devices to CPU and memory, and vice versa. It contains internal buffers to hold data temporarily before sending it. A computer executes programs by going through a fetch-decode-execute cycle.

First, it fetches instructions from memory. It then decodes these instructions to determine which operation to run. Finally, it executes the instructions. Execution can vary; reading an I/O device, moving data in memory or outputting the results of an operation. This process repeats until all instructions execute. How fast the CPU runs through these cycles determines the speed of a computer. For our purposes, the virtual machine we will be building will have at most two components of a physical computer.

That brings us to the next thing to keep in mind, a virtual machine is similar to an actual device. Since we are building a language virtual machine, we don’t need to replicate the full capabilities of a physical computer. We will focus primarily on the CPU and Memory since all we want is a way to run the instructions that our compiler will generate. It’s important to appreciate that virtual machines come in different shapes and sizes. How you choose to implement a virtual machine depends on the need.

October 3, 2022