October 10, 2022

Crafting a compiler in Rust : Chapter 3

Have you ever wondered how programming languages like Python, Java or Ruby work? Do you have an interest in building your own language? If so, this series of posts is for you. We dive into building a novel programing language from first principles. Get ready to have your hands dirty. It’s going to be a fun ride. Access the final product here : https://github.com/daviddexter/karis-programming-language

A look into lexical analysis

Lexical analysis or scanning is the first step in compiler construction. In this step, we split the source text into small units, called tokens. Tokens represent a specific intent and meaning to the compiler. You can think of tokens as the building blocks of the grammar for the compiler. Splitting the source text into tokens is sometimes called tokenization. We split the program into tokens to provide a more accessible form to process in subsequent steps. The program that performs lexical analysis is called a lexer or scanner. The lexer walks the entire source text and returns available tokens as a list or some other form. Here is an example of program scanning. Consider a simple program:

let x @int = 1;

When we feed the program to the lexer, it returns:

[
    LET,
    VARIABLE(x),
    TYPEKIND(int),
    ASSIGN,
    INTEGER(1),
    SEMICOLON
]

The extracted token can either represent some intrinsic meaning of the program or have actual values associated with them. For instance, in the above output, LET represents the let keyword that binds a variable. SEMICOLON in the same manner, indicates an end of a statement. On the other hand, INTEGER has a value associated with it which is the actual literal. Depending on the need at hand, whitespace can hold a significant meaning. For example, in programs written in Python, whitespace is treated in high regard.

For our purposes, We treat whitespace as a separator. Each token can also come with some extra information like its positioning. This information is super valuable in the event of errors. Instead of throwing a generic error, something more helpful like which token, its line and column number will come in handy when debugging. Imagine how challenging it would be without this information. We would be left guessing where an error might have occurred. And that’s why we will include this information in our lexer implementation.

Lexer implementation workflow

We will build an ad hoc lexer. This kind of lexer is specific rather than general. In that sense, our final lexer will only serve our specific purposes. First, we will define our tokens which is the easiest to achieve. We look at the sample programs written in Karis and derive them using tokens. Next, we will implement the feeder. Yes, feeder. It does what we can imagine it to do. It ingests the entire source text. The source text can be of varied sizes. A single line of text, or hundreds to thousands of lines. Finally, we will conclude the generator.

The generator will have two inter-dependent layers, the extractor, which runs the source text extracting tokens and the output layer that either assembles all tokens into a single data structure or one by one. Because it is more fun to learn, we will implement both. It will be nice to compare and contrast both implementations.

Examine a Karis program

Here is a subset of a Karis program:

let fibonacci @int = fn(n @int){
    if n == 0 {
      return 0;
    };

    if n == 1 || n == 2 {
      return 1
   };

    return fibonacci(n-1) + fibonacci(n-2);
};

@main fn(){
   let x @int = 5;
   let name @string = "Karis";
   let result @int = fibonacci(x);

   print(name);
   print(result);

}@end;

Let’s dig in.

Line 1: We use a let keyword that binds to a function. We follow the binding with fibonacci, a user-provided name. Next is the @int symbol. In Karis, this symbol tells the compiler that the identifier fibonacci has a value whose internal representation is an integer.

The = character assigns whatever is the output on the right-hand side to the identifier on the left-hand side. After the defined function runs, its returned value will be associated with the variable fibonacci.

The fn symbol declares an intent to define a function. The ( and ) characters enclose arguments that we may call the function with. n and @int are similar to what we encountered earlier. Finally, { ends the line by opening a block.

In this single line:

let fibonacci @int = fn(n @int){ < 10
 ^    ^        ^   ^ ^ ^^  ^  ^
 1    2        3   4 5 67  8  9

We have identified ten tokens. Let’s move to the following line. We will skip over items similar to what we encountered earlier.

Line 2: if symbol begins an intent of control flow. The following new token is ==. However, == is a combination of two = symbols. The meaning of== is not the same as that of =. Hence we treat either entry as an independent token.

Line 3: We encounter return that indicates sending a value back to the function caller.

Line 6: The || symbol gives us a glimpse into comparison operators.

Line 10: We interpret fibonacci as a function call rather than a function name identifier. It will hence own its distinct token.

Line 13: We encounter a rather exciting token. @main. The symbol indicates that the associated function is the program’s entry point. As you can see, this symbol is unique to the Karis language. No other modern programming language defines its entry point in such a manner. Kudos Karis!!!

Line 18 and Line 19: We find print, another function call. Unlike other user-defined functions, this function comes in-built within the program. We can execute it at any time in the program’s lifetime.

Line 21: We end the program with a @end symbol, another unique syntax that is only present in the Karis language.

From the examination, you can tell that lexing can be very lengthy. Each piece of the code has to be analyzed and categorized in isolation. As we build the lexer, we will define a few more tokens that will support the operations of the lexer. These tokens don’t have any associated symbol, instead, they are operationalized by logic.

Project file structure

The lexer crate will contain three files.

  1. tokens.rs

This file will hold individual units (tokens) of the language we are building.

  1. lexer.rs

This file defines the entry point of the lexer. It will house the structure that speaks to the purpose of the lexer.

  1. lib.rs

A library file that exports modules of the lexer.

Tokens file

Create the file tokens.rs. We will now define the constant representing symbols that Karis expects. For example,

pub const COMMA: &str = ",";

Each symbol comes with an associated constant name that refers to it. There is nothing fancy going on here. With the symbols in place, we now need to define an object that the lexer can use to identify a character. Let’s use an example to explain this. COMMA identifies the symbol ,. But we can’t pass COMMA around since it’s a constant. We need a uniform object containing all variants pointing to a symbol. In Rust, we achieve this using the enum type. Append the following to token.rs.

#[derive(Debug, PartialEq, Eq, Hash, Clone, Copy)]
pub enum IdentifierKind {
    UNKNOWN,

    EOF, // end of file. Nothing more to read
    EOS, // end of statement or block

    // Identifiers + literals
    VARIABLE,       // add, foobar, x, y, ...
    INTLITERAL,     // 1343456, 1.22, 23.781
    STRINGLITERAL,  // "Alice", "in", "wonderland"
    BOOLEANLITERAL, // "true", "false
    .
    .
    .
}

Here we introduce an enumeration IdentifierKind. As we can see, the variants are named similarly to the constants to indicate their association. Operation variants such as EOF, EOS, UNKNOWN or BLOCK do not have a corresponding constant identifier.The line #[derive(Debug, PartialEq, Eq, Hash, Clone, Copy)] has a special meaning attached to the enum.

In Rust lingua, the enum IdentifierKind “fulfils” the traits Debug, PartialEq, Eq, Hash, Clone and Copy. Traits are synonymous with interface in programming languages. We expect each symbol to have a corresponding IdentifierKind. But that’s not always the case. In such a scenario, a sensible default value will suffice. In our case, UNKNOWN fulfils this requirement quite nicely. We do this as below.


impl Default for IdentifierKind {
    fn default() -> Self {
        IdentifierKind::UNKNOWN
    }
}

Finally, we can define a structure that the lexer will return while reading the source file. We use the form to build the program’s Abstract syntax tree. Essentially, a token.


/// Token is a single identifiable unit
#[derive(Debug, Default, PartialEq, Eq, Clone)]
pub struct Token {
    pub token_type: IdentifierKind,
    pub literal: String,
    pub line_number: usize,
    pub column_number: usize,
}

impl Token {
    pub fn new(
        token_type: IdentifierKind,
        literal: String,
        line_number: usize,
        column_number: usize,
    ) -> Token {
        Self {
            token_type,
            literal,
            line_number,
            column_number,
        }
    }
}

Our Token will have a token_type, literal, line_number and column_number. token_type will tell the lexer what kind of a token it is. literal is the value as read from the source.

Lexer file

This file is the core of the entire lexer crate. First, we need to define the structural object of the lexer.

use crate::tokens;

#[derive(Debug, Clone)]
pub struct Lexer {
    input: String,
    position: usize,
    line_number: usize,
    read_position: usize, // <- points to the next character in the input string after the current char
    ch: Option<String>,
    identifier_start_read_position: isize, // <- indicates the starting point when an identifier has been read
    last_read_token: Option<tokens::Token>, // <- // indicates the previous character in the input string after the current
    eos: bool, // <- // indicated end of statement>
}

impl Lexer {
     pub fn new(input: String) -> Lexer {
        Lexer {
            input,
            position: 0,
            line_number: 1,
            read_position: 0,
            ch: None,
            identifier_start_read_position: -1,
            last_read_token: None,
            eos: false,
        }
    }
}

In the first line, we bring the tokens we just defined in tokens.rs. Next, we shape the lexer object, which has the following attributes;

  1. input

The actual source code that we feed to the lexer. The input will always come in the form of a string.

  1. position

Points to the character in the input string that the lexer is reading.

  1. line_number

The currently read character position along the length of the input string.

  1. read_position

Points to the next character in the input string after the current character.

  1. ch

The actual character the lexer is reading.

  1. identifier_start_read_position

Indicates the starting point when an identifier is in scope.

  1. last_read_token

Indicates the previous character in the input string after the current.

  1. eos

End of statement indicator.

We will build upon the file to add more sound logic that will guide the lexer.

Library file

Here we export what we have defined in token.rs and lexer.rs so they are available in other parts of the project.

pub mod lexer;
pub mod tokens;

## Traversing the source

The lexer’s primary job is to walk down the entire source text, extracting what each character represents. The lexer can return either a valid token or an error with every read. You may ask, why would we care about a mistake? Well, it’s a manner of defensive programming. We should anticipate that the source text may contain erroneous input and the error indicates that something catastrophic has occurred. The Rust object we will use is Result. Result is an enum that represents either a success or failure. To meet our needs, we will need somewhat custom errors. We don’t want to rely on generic errors provided by Rust standard library.

Custom errors

Our custom errors will be available across different crates. Hence, we will define it as a crate of its own. We begin by adding a new member in our root Cargo.toml file.

[workspace]
members = ["lexer","errors"]

With that out of the way, hop into the terminal and create a new crate library called errors.

Type the commands in the terminal for execution.

$ cargo new --lib errors

Similar to the previous crate we created, it comes with its own Cargo.toml, src directory and a template lib.rs. The errors crate will be pretty small. It contains definitions of errors we may encounter. We will add more as we need them. Inside the errors>src directory, we find no file other than an empty lib.rs file.

$ ls errors/src/  && cat errors/src/lib.rs

Let’s change that. We will begin by defining only one error. We have the choice of using in-built errors that come with the Rust standard library. However, we want to express our own set of errors that fit our purpose. Our custom error will be called KarisError. It’s a simple struct with two fields, error_type and message.

#[derive(Debug, Clone)]
pub struct KarisError {
    pub error_type: KarisErrorType,
    pub message: String,
}

message: It is the literal string representation of the error. It helps in debugging.

error_type: Indicates what type of error we encounter. We anticipate we will have different types of errors. It will therefore be of value to be able to tell which exact type of error we find. As you can see, the error is of type KarisErrorType, which is an enum. Let’s define it.

#[derive(Debug, PartialEq, Eq, Copy, Clone)]
pub enum KarisErrorType {
    UnknownToken,
}

We only have one type of error, UnknownToken. We will add more to fit our needs. Let’s continue by adding more infrastructure. An error is valuable if it provides enough information to be actionable. The typical way of consuming such information is via printing the error. In our current setup, we have no way of printing the error in the standard error. This is so because Rust requires us to be explicit if we need such a mechanism. Let’s fix that.

impl fmt::Display for KarisError {
    fn fmt(&self, fmt: &mut fmt::Formatter) -> fmt::Result {
        let err_msg = format!(
            "An error occurred while processing. Details \n \t\t\t Type : {:?} \n \t\t\t Message : {:?}",
            self.error_type, self.message
        );
        write!(fmt, "{}", err_msg)
    }
}

Next, we want our custom error to be able to be converted from the in-built error provided by the standard library, specifically io.Error. Why is this important? The lexer will read files from the underlying file system. Such can fail hence throwing an error. In Rust, io.Error will be thrown. Since we have a custom error, we need a way to interpret such an error to fit the definition of our custom error. Let’s fix that.


impl From<KarisError> for io::Error {
    fn from(err: KarisError) -> Self {
        io::Error::new(
            io::ErrorKind::InvalidData,
            format!(
                "\n [Error type] : {:?} \n [Message] {} \n",
                err.error_type, err.message
            ),
        )
    }
}

With that, we finish implementing our custom error. We now expose the error by adding an entry in lib.rs.

pub mod errors;

To demonstrate this, we append errors as a dependency in the lexer crate.

errors = { path = "../errors" }

This addition tells the lexer crate that it should depend on errors found in the directory errors. In Rust, we source dependencies from varied locations. In the example above, it’s the filesystem. Other sources are crate.io or any Rust git repository

Read tokens implementation

Back to the lexer.rs. We will now define a function called read_tokens which will serve as the actionable entry point of the lexer.

pub fn read_tokens(&mut self) -> Result<tokens::Token, errors::KarisError> {
    match self.eos {
            true => {
                self.eos = false;
                Ok(tokens::Token::new(
                    tokens::IdentifierKind::EOS,
                    String::from(""),
                    self.line_number,
                    self.position,
                ))
            }
            false => {
                todo!("todo")
            }
    }
}

Let’s dissect its logic. In Line 2, we use Rust match syntax on the eos attribute, which is part of the lexer. match is similar to what you would expect in other languages as switch. If we encounter a situation where eos is true, we return a token of a kind IdentifierKind::EOS.The question we should ask is when will we encounter such a situation? It’s simple. When the lexer completes analyzing a statement, eos is set to true so that it reminds itself in the next run where to start. Don’t worry if that seems obtuse right now. We will observe it in action as we proceed. We need to do something if eos is false. That is, we should read a token from the source again.

let res = self.new_token();
if let Ok(tok) = &res {
    if tok.token_type == tokens::IdentifierKind::SEMICOLON {
        let default = tokens::Token::new(
                            tokens::IdentifierKind::UNKNOWN,
                            String::new(),
                            self.line_number,
                            self.position,
        );

        let last_tok = self.last_read_token.as_ref().unwrap_or(&default);

        match last_tok.token_type {
                tokens::IdentifierKind::INTLITERAL
                | tokens::IdentifierKind::STRINGLITERAL
                | tokens::IdentifierKind::TRUE
                | tokens::IdentifierKind::FALSE
                | tokens::IdentifierKind::RPAREN
                | tokens::IdentifierKind::RBRACE
                | tokens::IdentifierKind::END => {
                    self.eos = true;
                }
                _ => {
                    self.eos = false;
                }
        }
    }

    self.last_read_token = Some(tok.clone());
}
res

In Line 1 we introduce a new function new_token. This function is responsible for doing the actual reading of the source text. It spits out one token at a time until we reach the end of the source text. In Line 2, we extract the token from the Result returned using a neat Rust feature. If we have a token (which we should), we check whether its token_type attribute matches that of a SEMICOLON as evident in Line 3.

Let’s pause to have the underlying reasoning of what we want to achieve here. Every statement in Karis language is terminated with a semicolon. Statements are terminated in a few exceptional cases depending on the previously read token. A statement is eligible for termination if the previous token has a token_type that matches either INTLITERAL, STRINGLITERAL, TRUE, FALSE, RPAREN, RBRACE or END. We must set the lexer’s eos attribute in such a scenario to true.

We can now proceed. Line 4 to Line 9 defines a default token which will be returned if the retrieval in Line 11 fails to find the token. Next, we match the token against the termination signals token and then update the eos appropriately, as seen in Line 21 and Line 24. Finally, we update the last_read_token and then return it.

New token implementation

The core of new_token is simple.

fn new_token(&mut self) -> Result<tokens::Token, errors::KarisError> {
        if self.read_position == 0x0 {
            self.read_char();
        }

        self.skip_whitespace_and_string_escape();

        match &self.ch {
            Some(ch) => todo!("todo")
            None => todo!("todo"),
        }

}

We check whether the read_position is the first token in each run.

read_position points to the character in the following position in the input string.

If true, we call another function called read_char (to be defined later). Otherwise, we remove any whitespace in the input text. In our case, whitespace has no intrinsic meaning. It is only a delimiter. With whitespaces removed, we now do some work.

Overview of helper functions

Sound software engineering dictates that whenever you have a logic called multiple times from multiple places, it is better to have it as a single function that can be called numerous times instead of duplicating logic. This practice is what we want to achieve by defining a few helper functions. The helper functions will vary in logic complexity. We will strive, however, to make them as simple as possible.

_readchar

This function reads the read_position and asserts it in comparison to the size of the source text.

ch is the actual character that is in scope.

fn read_char(&mut self) {
    if self.read_position >= self.input.len() {
            self.ch = Some(String::from(tokens::NULL));
    } else {
            let item = self.input.get(self.read_position..self.read_position + 0x1);
            if let Some(val) = item {
                 self.ch = Some(String::from(val));
            }
    }

    self.read_position += 0x1;
}

In Line 2, we first check the current entry in read_position against the size of the entire source text. In this scenario, we assert whether the read_position is greater or equal to the size of the source text. Such an assertion will be true if we have reached the end of the source text. If that is the case, we set ch to NULL in Line 3 to end the lexing. Otherwise, in Line 5, we try to read the item between read_position and the following read_position. Let’s put this into context; imagine the source text as a string with variable length.

let source_text = "source text";

When we begin reading from the start of the source text, the read_position is initialized at position 0, which points to the character s. Since we have not reached the end of the source text, we index the string. We can visualize the source text as an array of characters. In this example, to get the item at index 0, we call Rust String get from 0 to the following character index, 0+1. Unlike languages like Python, Rust does not allow typical string indexing.

let source_text = "source text";
let item = source_text[0]  // <- this is illegal>

Recall: read_position points to the next character in the input string after the current character.

In Line 10, we reset the read_position to point to the next character.

move_current_position_and_read

This function moves the cursor to point to the index of the current character in focus. It works in tandem with read_char to update ch.

Recall: ch is the literal value of the current character

fn move_current_position_and_read(&mut self) {
        self.position += 1;
        self.read_char()
}

skip_whitespace_and_string_escape

This function removes whitespace, newline characters and tab characters. It recursively runs until it finds a character that is not a space, newline, or tab character.

fn skip_whitespace_and_string_escape(&mut self) {
        let current = self.ch.as_ref().unwrap();

        if is_newline(current.as_bytes().first().unwrap()) {
            self.line_number += 0x1;
            self.move_current_position_and_read();
            return self.skip_whitespace_and_string_escape();
        }

        if is_space(current.as_bytes().first().unwrap())
            || is_newline(current.as_bytes().first().unwrap())
        {
            self.move_current_position_and_read();
            self.skip_whitespace_and_string_escape()
        }
}

In Line 2, we retrieve the current character in focus. Predictably, it will be a single character only. We then convert it into a byte. Then we assert that byte is a newline character as evidenced on Line 4. If the assertion is confirmed, we update the line_number, move the cursor forward, and then call skip_whitespace_and_string_escape again on Line 7. In Line 10, we do a similar thing but also assert the presence of whitespace.

forward_is_any_token

This function asserts a token resides within the bounds of the position and read_position is the same as tokens provided in the example vector.

fn forward_is_any_token(&self, examples: Vec<&str>) -> Option<bool> {
        // peek forward
        let item = self.input.get(self.position + 1..self.read_position + 1);
        let mut r = false;
        for x in examples.iter() {
            if item == Some(*x) {
                r = true;
                break;
            } else {
                continue;
            }
        }
        Some(r)
}

In Line 3, we retrieve the token within the bounds of position and read_position. In Line 5, we iterate through the examples vector (typically referred to as array) and check whether it matches the retrieved item. If the assertion passes, we end the check after setting the return value to true. Otherwise, we continue the iteration and check all items in the examples vector.

unknown_token_error

This function will return an error when it finds an unknown token.

fn unknown_token_error(&self, ident: &str) -> Result<tokens::Token, errors::KarisError> {
        Err(errors::KarisError {
            error_type: errors::KarisErrorType::UnknownToken,
            message: format!("identifier not known : {}", ident),
        })
}

traverse_to_closing_quotations

This function moves along the length of a string literal until it encounters a closing quote mark.

fn traverse_to_closing_quotations(&self, begin: usize) -> usize {
        let next = self.input.get(begin..begin + 0x1).unwrap();
        if is_quotation_mark(next) {
            begin
        } else {
            self.traverse_to_closing_quotations(begin + 0x1)
        }
}

Line 3 is the core of the function. It asserts whether the retrieved token in Line 2 is a quotation mark. The function will be of value used in analyzing the string literals.

The definitions below are top-level functions. They are not part of the lexer impl block.

is_alphanumeric_only

This function checks if an argument consists of integers and alphabet characters. For example,fname, fname0, fname2, f_name, f_name1 orlname123.

fn is_alphanumeric_only(i: &str) -> bool {
    let mut not_valid_count = 0;
    for x in i.as_bytes() {
        if !is_allowed_alphanumeric_and_char(*x) {
            not_valid_count += 1;
        }
    }
    not_valid_count == 0
}

In Line 3, we iterate over each byte representation of the string and then assert whether that byte is an alphanumeric character.

is_integers_only

This function checks if an argument is composed of integers only.

fn is_integers_only(i: &str) -> bool {
    for x in i.as_bytes() {
        if !is_digit(*x) {
            return false;
        } else {
            continue;
        }
    }
    true
}

In Line 2, we iterate over each byte representation of the string, then assert on Line 3 whether that byte is a digit. That is zero to nine.

is_quotation_mark

This function checks if an argument is a quote character—specifically, double-quote or single-quote characters.

fn is_quotation_mark(i: &str) -> bool {
    for x in i.as_bytes() {
        if *x == 0x22 || *x == 0x27 {
            return true;
        } else {
            continue;
        }
    }
    false
}

In Line 2, we iterate over each byte representation of the string, then assert on Line 3 whether that byte is equivalent to the byte representation of either a double-quote or single-quote character.

is_allowed_alphanumeric_and_char

This function asserts that the given character is a valid alphanumeric or symbol.

fn is_allowed_alphanumeric_and_char(chr: u8) -> bool {
    is_alphabetic(chr) || is_digit(chr) || is_underscore(chr) || is_hash(chr)
}

It’s a generic function that combines other smaller tasks to purpose a specific use case.With all the necessary helper functions in place, we are ready to move to the next step of doing actual tokenizations.

Share

© Mwangi Kariuki 2019-2024