October 15, 2022

Crafting a compiler in Rust : Chapter 4

Have you ever wondered how programming languages like Python, Java or Ruby work? Do you have an interest in building your own language? If so, this series of posts is for you. We dive into building a novel programing language from first principles. Get ready to have your hands dirty. It’s going to be a fun ride. Access the final product here : https://github.com/daviddexter/karis-programming-language

Overview of character tokenization

The term “character” is relatively broad. 1 is a character. So is & or #. For our purpose in this lesson, “character” will refer to any non-alphanumeric characters. We do not consider numbers and alphabets in our endeavour. Karis accepts quite a decent number of characters, except for ~.

One-piece characters

These characters are the easiest to tokenize. We call them one-piece characters because they are composed of only one character. An example would be ; or (. Nowhere in Karis syntax do we expect to have ;; or ((. Even visually, it looks fundamentally wrong.

COMMA token

We represent a comma as COMMA in the lexer. Let’s see how we will achieve it. The lexer.rs file will expand the new_token function to contain the logic for the tokenization. In the current implementation, after removing whitespaces, we move to the match expression, where we evaluate the presence of a character. There are two possible cases, presence or absence. Each indicated with Some and None as shown on Line 8 and 9, respectively.

fn new_token(&mut self) -> Result<tokens::Token, errors::KarisError> {
        if self.read_position == 0x0 {
            self.read_char();
        }

        self.skip_whitespace_and_string_escape();

        match &self.ch {
            Some(ch) => {}
            None => { },
        }
}

Let’s handle the absent (None) bit first since it’s relatively straightforward. We earlier defined a unknown_token_error() function. We’ll put it into use here. Add this line in the None block.

self.unknown_token_error(" ")

That’s all that we need to do. Let’s now move to the Some block.

let ch_owned = ch.clone();

let tok = match ch.as_str() {
    tokens::COMMA => Ok(tokens::Token::new(
                        tokens::IdentifierKind::COMMA,
                        ch_owned,
                        self.line_number,
                        self.position,
    )),
}

self.move_current_position_and_read();
tok

In Line 1, we retrieve the current character in scope. Since the “character” is of type String, we can’t just assign it. In Rust, the String does not implement the Copy trait. This trait allows value to be “copied” around in memory. Essentially, copy creates duplicates in memory. Types that implement this trait are primitive types like u8, u16,i32, bool and &str. String types are heap allocated. They lack native copy support. As such, we must explicitly state that we intend to copy by calling it’s clone method.

In Line 3, we match the current ch against a bunch of patterns. This first pattern is tokens::COMMA on Line 4. If the pattern matches, we return a token object whose identifier is of kind tokens::IdentifierKind::COMMA. See, that was straightforward. The move_current_position_and_read function defined earlier is called last to move the cursor forward so that we can process the next character.

Similar one-piece character patterns

Moving on, let’s add similar pattern checks and return corresponding tokens.

Two-piece characters

Two-piece characters come in combinations of two. Consider a token signifying greater-than or equal to. Such will be represented by >= is a combination of > and =.

ASSIGN or EQUAL token

table

| Character | Meaning | | = | Assign | | == | Equal |

Before we introduce the implementation, let’s first outline how the logic will work.

Let’s demonstrate it in code.

tokens::ASSIGN => match self.forward_is_any_token(vec![tokens::ASSIGN]) {
    Some(val) => {
            if val {
                self.move_current_position_and_read();
                Ok(tokens::Token::new(
                    tokens::IdentifierKind::EQ,
                    ch_owned,
                    self.line_number,
                    self.position,
                ))
            } else {
                Ok(tokens::Token::new(
                    tokens::IdentifierKind::ASSIGN,
                    ch_owned,
                    self.line_number,
                    self.position,
                ))
            }
        }
        None => Ok(tokens::Token::new(
            tokens::IdentifierKind::ASSIGN,
            ch_owned,
            self.line_number,
            self.position,
    )),
}

In Line 1, we utilize the forward_is_any_token helper function to retrieve the next token if it matches ASSIGN. We expect the process to return a value or nothing. In Line 2, we are sure a result of the function is present. We check whether indeed a match is found in Line 3 and then return an EQ token. Otherwise, in Line 12, we produce an ASSIGN token.

OR token

table

| Character | Meaning | | | | Logical OR | | || | OR |

Before we introduce the implementation, let’s first outline how the logic will work.

Let’s demonstrate it in code.

tokens::PIPE => match self.forward_is_any_token(vec![tokens::PIPE]) {
        Some(val) => {
            if val {
                self.move_current_position_and_read();
                let lit = format!("{:?}{:?}", tokens::PIPE, tokens::PIPE);
                Ok(tokens::Token::new(
                    tokens::IdentifierKind::OR,
                    lit,
                    self.line_number,
                    self.position,
                ))
            } else {
                Ok(tokens::Token::new(
                    tokens::IdentifierKind::LOR,
                    ch_owned,
                    self.line_number,
                    self.position,
                ))
            }
        }

        None => Ok(tokens::Token::new(
            tokens::IdentifierKind::LOR,
            ch_owned,
            self.line_number,
        self.position,
    )),
},

In Line 1, we utilize the forward_is_any_token helper function to retrieve the next token if it matches PIPE. We expect the process to return a value or nothing. In Line 2, we are sure a result of the function is present. We check whether indeed a match is found in Line 3 and then return a OR token. Otherwise, in Line 12, we return a LOR token.

AND token

table

| Character | Meaning | | & | Logical AND | | && | AND |

Tokenization of & character borrows it’s implementation from the tokenization of either = or |. The code snippet below demonstrates that.

tokens::AMPERSAND => match self.forward_is_any_token(vec![tokens::AMPERSAND]) {
    Some(val) => {
        if val {
            self.move_current_position_and_read();
            let lit = format!("{:?}{:?}", tokens::AMPERSAND, tokens::AMPERSAND);
            Ok(tokens::Token::new(
                tokens::IdentifierKind::AND,
                lit,
                self.line_number,
                self.position,
            ))
        } else {
            Ok(tokens::Token::new(
                tokens::IdentifierKind::LAND,
                ch_owned,
                self.line_number,
                self.position,
            ))
        }
    }
    None => Ok(tokens::Token::new(
            tokens::IdentifierKind::LAND,
            ch_owned,
            self.line_number,
            self.position,
    )),
},

In Line 1, we utilize the forward_is_any_token helper function to retrieve the next token if it matches AMPERSAND. We expect the process to return a value or nothing. In Line 2, we are sure a result of the function is present. We check whether indeed a match is found in Line 3 and then return a AND token. Otherwise, in Line 12, we return a LAND token.

NOTEQ or BANG token

We represent the NOTEQ token using a combination of ! and = while the BANG token is represented by a single !. As we will see later, ! is eventually evaluated to mean negate. Let’s see how we tokenize it in code.

tokens::BANG => match self.forward_is_any_token(vec![tokens::ASSIGN]) {
    Some(val) => {
        if val {
            self.move_current_position_and_read();
                Ok(tokens::Token::new(
                    tokens::IdentifierKind::NOTEQ,
                    ch_owned,
                    self.line_number,
                    self.position,
                ))
            } else {
                Ok(tokens::Token::new(
                    tokens::IdentifierKind::BANG,
                    ch_owned,
                    self.line_number,
                    self.position,
                ))
            }
        }
        None => Ok(tokens::Token::new(
            tokens::IdentifierKind::BANG,
            ch_owned,
            self.line_number,
            self.position,
    )),
},

In Line 1, we retrieve the next character and compare it against the ASSIGN token. At this point, we already know that the current character is a !. If the match asserts the affirmative, we move the cursor forward as shown on Line 4, and then we return a NOTEQ token. Otherwise, we produce a BANG token. Tokenizing LTOREQ, LT, GTOREQ and GT tokens follow the same logic pattern of asserting the next character against the ASSIGN token and appropriately returning the corresponding token. We now have the infrastructure to tokenize characters in our source text.

Anatomy of tests in Rust

Rust comes with a powerful test runner that allows us to get to testing without any external dependencies. At its simplest, a test is a normal function decorated with a #[test] attribute. Let’s explore further with a concrete example.

In the lexer.rs, we will add the snippet below:

#[cfg(test)]
mod lexer_tests {
    use super::*;

    #[test]
    fn should_test_something() {
        // test body
    }

}

In Line 2, we introduce a new module, lexer_tests. The module’s name can be anything we need it to be. What is important is the #[cfg(test)] attribute on Line 1. The attribute tells the test runner that our lexer_tests module holds tests. In Line 3, we “import” everything from the parent namespace into the module’s namespace. This importation will allow us to access definitions in the parent namespace. In Line 5, the function should_test_something has an attribute above it. This attribute does only one thing. It identifies this function as a test. Hence it will not be part of the final binary after compilation.

Tests implementation

Let’s now add an actual test case to verify that our lexer tokenizes characters.

Verify character tokenization

We will begin by writing a test that asserts whether our current implementation can read any character we pass to it and return the corresponding token. Inside the lexer_tests module, add this test:

#[test]
fn should_read_char() {
    let mut lx = Lexer::new(String::from(",;(){}=+-!*/<>"));
}

In Line 3, we instantiate the Lexer with a string input containing characters. The expectation is that the lexer will be able to pinpoint what each character represents. As such, we need to add an assertion. We expect the first character to be a COMMA by examining the input string. Let’s prove that.

assert_eq!(
    lx.new_token().unwrap().token_type,
    tokens::IdentifierKind::COMMA
);

Wonderful. It works. Let’s trigger a failure and see how it behaves. Amend the expectation from tokens::IdentifierKind::COMMA to tokens::IdentifierKind::ASSIGN then rust the test again. It fails. Exactly what we expected. Let’s now move on to add more assertions. Each character will have its assertion. We now have a complete test verifying our lexer can read characters and return their corresponding token kinds.

Verify one-piece and two-piece tokenization

We will feed the lexer with an input string containing both one-piece and two-piece characters. The expectation is that it should be able to know whether a character is a one-piece or two-piece. For instance, >= should not return two tokens; GT and ASSIGN. Instead, it should return a single token of the kind GTOREQ.

#[test]
fn should_read_equal_token0() {
    let mut lx0 = Lexer::new(String::from("; == && || & |"));
    /* similar assertions */
}

The overall logic structure of this test is quite similar to the one described above.

$ cargo test --package lexer --lib -- lexer::lexer_tests::should_read_equal_token0

Fantastic, it works as a charm.

Similar tokenization verifications

Let’s expand the test with relatively similar test cases.

Test for EQ token

We begin with the EQ token.

#[test]
fn should_read_equal_token1() {
    let mut lx = Lexer::new(String::from("=="));
    assert_eq!(
        lx.new_token().unwrap().token_type,
        tokens::IdentifierKind::EQ
    );
}

In this test, given an input string that contains ==, we expect the lexer to identify its character-set as an EQ token.

Test an assemble of characters

In the following test case, we feed the lexer with a long sequence of characters. This test should validate that the lexer walks along the entire text length and identifies tokens. If the test succeeds, we can rest assured that the lexer can handle input texts of any size.

#[test]
fn should_read_equal_token4() {
    let mut lx = Lexer::new(String::from("==+>=!()=<>*<=-/~ -"));
    assert_eq!(
        lx.new_token().unwrap().token_type,
        tokens::IdentifierKind::EQ
    );
    /* trimmed for readability */
    assert!(lx.new_token().is_err());
    assert_eq!(
        lx.new_token().unwrap().token_type,
        tokens::IdentifierKind::MINUS
    );
}

There is one peculiar thing about the test case above. Near the tail end of the input string, we see a rather strange character, ~. The character is not part of our token set. The lexer should therefore return when it encounters the character. And that’s what is happening in Line 9. We have now demonstrated that our lexer can;

1 . Identify any legal character and deduce its token. 2 . Identify illegal characters 3 . Handle both one-piece and two-piece characters and deduce their respective tokens.

Overview of typings tokenization

Karis is a statically typed language. The compiler that converts programs written in Karis needs to know what “kind” of data a variable has. This typing information is explicitly declared using syntax markers that the language provides. That is @int, @string, @bool, and @unit.Here, we will implement mechanisms to tokenize such typing information. We will also include tokenization of @main and @end. since they share almost a similar tokenization logic flow. Let’s examine the structure of typing information.

@int
@string
@bool

Each information has a prefix of @ followed by an alphabetic sequence of characters. With this insight at hand, we now know that tokenizing a “Typing”, we first need to match their prefix and tail sequence.

Implemenation

We define a new function, extract_token_from_alphabet(). As its name indicates, this function will be responsible for extracting the token from a sequence of alphabet characters. Essentially, any character that is not a symbol, except for @.

fn extract_token_from_alphabet(&mut self) -> Result<tokens::Token, errors::KarisError> {
    // implementation here
}

Inside the function, let’s add a match and run a pattern match on the current character under observation.

match &self.ch {
    Some(current_literal) => {}
    None => self.unknown_token_error(tokens::NULL),
}

Rightly so, if we fail to match the current character, we return a NULL token. Before the match block, we will introduce a reusable the closure that looks ahead in the input string and returns the corresponding character.

// peek forward
let next_char_fn =  |begin: usize, end: usize, default| self.input.get(begin..end).unwrap_or(default);

Inside the Some, we will call the closure to retrieve the next token from the input string.

let next_char = next_char_fn(
    self.position + 0x1,
    self.read_position + 0x1,
    tokens::SEMICOLON,
);

Now that we have the current character and a means to get the next character, we can now add some checks to assert either of the following:

  1. If the current character is @, are the following four characters main?
  2. If the current character is @, are the following three characters end?
  3. If the next character is @, do the following sequence match either int, string, bool or unit?

Let’s handle the first check.

if current_literal == tokens::AT && next_char_fn(self.position, self.read_position + 0x4, tokens::NULL) == tokens::MAIN
{
    match self.input.get(self.identifier_start_read_position as usize..self.read_position + 0x4)
        {
        Some(ident) => {
            self.identifier_start_read_position = -0x1;
            self.position += 0x4;
            self.read_position += 0x4;

            let ident_owned = String::from(ident);
            Ok(tokens::Token::new(
                tokens::IdentifierKind::MAIN,
                ident_owned,
                self.line_number,
                self.position,
            ))
        }
        None => self.unknown_token_error(tokens::NULL),
    }
}

In Line 1, we compare the current character if it is of a kind AT and whether a sequence of four characters after the current character is of kind MAIN. When the condition is accurate, we are sure that we have encountered a @main pattern. From Line 7 to Line 8, we reset the cursor position by moving four steps forward. Finally, in Line 11, we return a token of an identifier MAIN. For END, the steps are similar except for the beginning, where we retrieve a sequence of three characters instead of four.

if current_literal == tokens::AT && next_char_fn(self.position, self.read_position + 0x3, tokens::NULL)== tokens::END

Good. We are making progress. The next check is a little bit involved. Once we assert that the current character is @, we will retrieve the sequence of characters between the identifier_start_read_position and read_position. These characters are then checked to assert if they match @int, @string, @bool or @unit.

if next_char == tokens::AT{
    match self.input.get(self.identifier_start_read_position as usize..self.read_position)
    {
        Some(ident) => {
            let ident_owned = String::from(ident);
            let tok = match ident {
                tokens::INT => Ok(tokens::Token::new(
                                    tokens::IdentifierKind::INTTYPE,
                                    String::from(ident),
                                    self.line_number,
                                    self.position,
                )),
                /*trimmed for readability*/
                _ => todo!("todo")
            };

            self.identifier_start_read_position = -0x1;
            tok

        }
        None => self.unknown_token_error(tokens::NULL),

    }
}

Notice that we have introduced another set of todos. While both will be retired at some point, the empty todo is interesting. Rust does not allow an if expression without the corresponding else. Hence we are forced to add an empty one as a placeholder. With extract_token_from_alphabet in place, we will inject it into the main logic flow under the new_token function. At the tail of our current new_token function, we have a todo that we will now retire and replace with:

{
    if self.identifier_start_read_position == -0x1 {
        self.identifier_start_read_position = self.position as isize;
    }
    self.extract_token_from_alphabet()
}

We will expand our test suite to verify tokenization of typing information works correctly.

Verify @int

Let’s begin by adding a simple test.

#[test]
fn should_read_equal_int_typing() {
    let mut lx = Lexer::new(String::from("@int"));
    assert_eq!(
        lx.read_tokens().unwrap().token_type,
        tokens::IdentifierKind::INTTYPE
    );
}

As you can see, very simple.

$ cargo test lexer::lexer_tests::should_read_equal_int_typing

Well, that was unexpected. The test fails. It seems our implementation reaches the todo!("empty todo") block. Why is that? In Line 412, we check the next_char instead of the current one. This check makes no sense. Let’s go back and have a look at a valid Karis syntax.

let echo @unit = fn(name @string){
    print("Name #name")
};

From what we have learned about lexer, it walks down the source code line by line until it finishes.Our implementation however does not satisfy that requirement. If we feed it the snippet above, it will only run on the first line. What we need is to add multi-line support. Why is multi-line support necessary? In practice, we write source code in text files and expect the programming language to read the source code from the text file. The anatomy of a text file is that it contains entries along each line, with “lines” separated by a \n ASCII character. In our case, Karis’s lexer does not have such an ability since it does not know how to handle line breaks. Let’s change that. In Line 412, we amend the else if condition to match the below snippet.

else if is_space(next_char.as_bytes().first().unwrap())
                    || next_char == tokens::COMMA
                    || next_char == tokens::LBRACE
                    || next_char == tokens::RPAREN
                    || next_char == tokens::AT
                    || next_char == tokens::SEMICOLON
                    || next_char == tokens::LPAREN

Here, we check whether the next character is either a whitespace, comma, left brace, right parentheses, at, semicolon, or left parentheses. The reasoning for this change is that a line can end with either a left brace {, a semicolon ;, or a right brace }. Likewise, a new line begins with whitespace or alphanumeric characters. In Line 2, we handle a somewhat subtle edge case of entity separation with a comma. This proves significant when we tokenize function arguments.

The next change is retiring todo! ("empty todo") with a proper implementation. The lexer will need to run through the entire in some fashion. Either iteratively or recursively. Using an iterator will not fit our use case. We will therefore employ recursion. extract_token_from_alphabet will call itself recursively, moving the cursor along the input string.

self.move_current_position_and_read();
self.extract_token_from_alphabet();

With those changes, we are now ready to run our tests. Try rerunning the test.

$ cargo test lexer::lexer_tests::should_read_equal_int_typing

Nice!!! It passes as expected.

Finalize typing tokenization verification

Testing for the rest of typing information will follow the same pattern. See the full implementation below.

$ cargo test lexer::lexer_tests

We have now verified typings tokenization is working correctly on to the next one.

Overview of Karis keywords

Karis has a limited set of keywords.

| Keyword | IdentifierKind |

| let | LET | | true | TRUE | | false| FALSE | | if | IF | | else | ELSE | | return | RETURN |

fn, which describes a function, is also a keyword. However, in this lesson, we will limit our scope to the keywords above.

Tokenization implementation

We pick from where we left off, which looks like below.

let tok = match ident {
    tokens::INT => Ok(tokens::Token::new(
                    tokens::IdentifierKind::INTTYPE,
                    String::from(ident),
                    self.line_number,
                    self.position,
                    )),
    /* trimmed for readability */
    _ => todo!("todo")
};

We focus on retiring the todo on Line 9 with valid match cases for each keyword. Tokenizing keywords follow the same logic. First, we match the identifier string against the semantics of a particular token. If the match succeeds, we return the token representation. The thing to be keen about is the identifier string which is a block from the input string starting from the identifier_start_read_position position to the read_position.

Recall : identifier_start_read_position indicates the starting point when an identifier is been read.

Recall: read_position points to the next character in the input string after the current character.

We visualize the input string as an array and slice to retrieve items contained between two bounds. In this case, what we get is a &str. We then match this &str against keyword identifiers. Let’s demonstrate this.

match self.input.get(self.identifier_start_read_position as usize..self.read_position){
    Some(ident) => {
        let tok = match ident{
            tokens::LET => Ok(tokens::Token::new(
                tokens::IdentifierKind::LET,
                String::from(ident),
                self.line_number,
                self.position,
            )),
            //** add more match patterns here **/
        };
        tok
    }
}

In Line 1, we slice the input string retrieving entries between the specified bounds. In Line 3, we assign the match expression to a variable tok whose value will be evaluated inside the matching body. In Line 4, we check if the ident matches LET and then return a result. That’s all that is involved in keyword tokenization. We need to add similar match patterns for the rest of the keywords.

Overview of the verification procedure

The manner to verify keyword tokenization is similar for each keyword. Let’s add a simple test to demonstrate.

#[test]
fn should_read_let_keyword() {
    let mut lx = Lexer::new(String::from("let"));
    assert_eq!(
        lx.read_tokens().unwrap().token_type,
        tokens::IdentifierKind::LET
    );
}

As you can see, the test is somewhat self-explanatory. We initialize the lexer with an input text containing “let” and then call read_tokens and assert its result. Quite ease. Right. We can now run it.

$ cargo test lexer::lexer_tests::should_read_let_keyword

It works.

Edge cases verification

Let’s switch gears a little bit. We now restructure the input text to mimic a programming language syntax. We will feed the lexer input that looks like this:

if {
    return;
}else{
    return;
}

The intent is that we want to be sure that our lexer works with any input of any structure and correctly produces tokens for keywords. Consider this as stress testing of the lexer. Let’s add the test.

#[test]
fn should_read_if_else_return_keywords() {
    let mut lx = Lexer::new(String::from("
        if {
            return;
        }else{
            return;
        }
    "));
    assert_eq!(
        lx.read_tokens().unwrap().token_type,
        tokens::IdentifierKind::IF
    );
    /* trimmed for readability */
    assert_eq!(
        lx.read_tokens().unwrap().token_type,
        tokens::IdentifierKind::RBRACE
    );
}
expects, it will always work. Let’s demonstrate this as a concrete example. Consider the Golang program.

column widget

expects, it will always work. Let’s demonstrate this as a concrete example. Consider the Golang program.

column widget

expects, it will always work. Let’s demonstrate this as a concrete example. Consider the Golang program.

column widget

expects, it will always work. Let’s demonstrate this as a concrete example. Consider the Golang program.

column widget

crete example. Consider the Golang program.

column widget

package main

import "fmt"

func main() {

    if 7%2 == 0 {
        fmt.Println("7 is even")
    } else {
        fmt.Println("7 is odd")
    }

    if 8%4 == 0 {
        fmt.Println("8 is divisible by 4")
    }

    if num := 9; num < 0 {
        fmt.Println(num, "is negative")
    } else if num < 10 {
        fmt.Println(num, "has 1 digit")
    } else {
        fmt.Println(num, "has multiple digits")
    }
}

We can see there are a lot of differences in syntax compared to Karis. However, there are similarities between conditional statement formation and logical conditionals. When we pass this snippet to our lexer, it will only succeed in identifying what it knows or assumes to be correct. For instance, package is interpreted as a VARIABLE. Same to main and import. On the other hand, "fmt" is interpreted as a string literal.

Share

© Mwangi Kariuki 2019-2024