November 1, 2022

Crafting a compiler in Rust : Chapter 5

Have you ever wondered how programming languages like Python, Java or Ruby work? Do you have an interest in building your own language? If so, this series of posts is for you. We dive into building a novel programing language from first principles. Get ready to have your hands dirty. It’s going to be a fun ride. Access the final product here : https://github.com/daviddexter/karis-programming-language

Integer literals

Karis represents integers using a simple dialect. It limits its scope to numbers without decimal points. An entry like 100.00 is an invalid value in Karis. Let’s add a mechanism to tokenize integers in the lexer.rs file. Earlier, we introduced the is_integers_only helper function. The function takes a character or sequence of characters and returns an assertion whether the character is an integer or not. Therefore, we inject it into the lexer.rs by adding the code below.

if is_integers_only(ident) {
    Ok(tokens::Token::new(
        tokens::IdentifierKind::INTLITERAL,
        ident_owned,
        self.line_number,
        self.position,
    ))
}

The condition asserts the presence of an integer. Let’s demonstrate its internal working visually. Consider a value like 2290. The lexer will work on each element of the value.

2  2  9  0

For each element, it will check whether it is a digit. If the assertion is true, it returns that the entire sequence of items is an integer. Consider a value like 22B3.

2:digit  2:digit  B:not digit  3:digit

As you can see in Line 1, is_integers_only takes an ident argument of type &str, but returns it as type String. In Rust, &str and String are different. &str can be moved from one function to another so long as its “lifetime” is guaranteed. String type, however, must have one owner at a time. We, therefore, need to convert ident to an owned variant. We do this by creating a copy of ident using the clone method.

let ident_owned = String::from(ident);

In Line 4, we use ident_owned, created from the cloning process. That’s it. Integers literals tokenization is done.

Variable names and string literals

A string is interesting. Let’s examine the code below.

let name @string = "Karis";

Excluding let, @string, = and ;, name and "Karis" are represented as strings. In our context, however, one is a variable, while the other is a literal. Let’s look at another example.

let add @int = fn(x @int, y @int){
    return x + y;
};

add(1,1);

add in Line 5 is a function call. From a textual perspective, it is just a string. The key to identifying it as a function call is the adjust parentheses. A function call can have no arguments, single or multiple arguments.

add();
add(1,1);
add(1,2,3,4,5,);

The above are all good calls to a function. With all the above in context, it introduces an exciting set of problems we need to address. We need a way to distinguish:

  1. String literal
  2. Function call
  3. Variable names

String literal

We’ll begin with string literals. A string literal in its essence starts and ends with a quotation mark. This gives a signal of when a string literal begins and ends. We’ll therefore introduce a condition to check for the presence of an opening quotation mark. We’ll amend extract_token_from_alphabet.

if is_quotation_mark(current_literal) {
    let string_literal_ending = self.traverse_to_closing_quotations(self.position + 0x1);

    let string_literal = self
                        .input
                        .get(self.position + 0x1..string_literal_ending)
                        .unwrap();

    self.identifier_start_read_position = -0x1;
    self.position = string_literal_ending;
    self.read_position = string_literal_ending + 0x1;

    Ok(tokens::Token::new(
        tokens::IdentifierKind::STRINGLITERAL,
        String::from(string_literal),
        self.line_number,
        self.position,
    ))
}

In Line 1, we provide is_quotation_mark with the current literal in focus and check whether it is a quote mark. If true, the next thing we need in Line 2 is to recursively move along the length of the input string until we find a closing quote mark. This is the most crucial bit of our logic. The recursive call returns the index of where the closing quote mark is located. Consider the example below as an example.

":0  a:1   l:2   e:3   x:4  ":5

Once we have the index of the closing quote mark, we then slice the input string to retrieve the literal between the quote marks, as seen in Line 4. In Line 7, we reset identifier_start_read_position, meaning the next recursive call will not assume to be working on an alphanumeric entry. Finally, we move the cursor past the closing quote mark and return the string literal.

Function call and variable names

Handling functions call, and variables names use the same logic. We expand from the previous if statement by adding a check for alphanumeric characters. Lucky, we already have what we need in place. We added is_alphanumeric_only() helper function. It works similar to is_integers_only with the only difference of checking for both digits and alphabets.

Consider a value like 22B3.

2:digit  2:digit  B:alphabet  3:digit

For the above, is_alphanumeric_only will assert to true.

else if is_alphanumeric_only(ident) {
    /* add more checks */
}

Inside the body of is_alphanumeric_only, we will add a condition check for the next character. Why is this? The only difference between function call and variable are, a function call has an opening parenthesis immediately following it.

if next_char == tokens::LPAREN {
        Ok(tokens::Token::new(
            tokens::IdentifierKind::CALLER,
            ident_owned,
            self.line_number,
            self.position,
        ))
} else {
        Ok(tokens::Token::new(
            tokens::IdentifierKind::VARIABLE,
            ident_owned,
            self.line_number,
            self.position,
        ))
}

In Line 1, we check if the next character in the input string is an LPAREN. If so, we return the new token as a CALLER or a VARIABLE.

We now can handle a valid Karis syntax. The expectation is, given a source text, the lexer will recognize:

  1. LET binding
  2. Typing information, for instance, @int, @string
  3. Operators, for instance, +, -, *
  4. Variable names
  5. Integer literals
  6. String literals

We will introduce more test cases that will validate our expectations.

#[test]
fn should_read_equal_token5a() {
    let mut lx = Lexer::new(String::from("let a = 1 + 2;"));
}

We define a test named should_read_equal_token5a. In Line 3, we bootstrap the lexer and use let a = 1 + 2; as it’s input. By examing the input, we can intuitively deduce what the lexer should return on each call of read_tokens.

let     a   =    1   +   2   ;

The assertion statement is pretty straightforward. We call the read_tokens method and assert if the returned token matches what we expect.

assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::LET);

In Line 1, we match the result against the LET. Let’s do another similar test. In place of the + operator, we’ll switch with the - operator.

#[test]
fn should_read_equal_token5b() {
    let mut lx = Lexer::new(String::from("let a = 1 - 2;"));
    /* trimmed for readability */
}

The lexer expects to identify the - symbol and return the appropriate token. We are now sure that the current lexer implementation can correctly identify and tokenize

Verify string literals tokenization

Next, we will assert string literals. We introduce another test with an input of type string.

#[test]
fn should_read_equal_token7() {
    let mut lx = Lexer::new(String::from("let name = \"alice\"; "));
}

Much is shared with the previous test. In this test, however, we expect the lexer to correctly identify "Alice" input as a string, not as variable or anything else.

#[test]
fn should_read_equal_token7() {
    let mut lx = Lexer::new(String::from("let name = \"alice\"; "));
    /* trimmed for readability */
    assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::STRINGLITERAL);
    /* trimmed for readability */
}

Line 5 is the core of the test where it asserts the presence of a string literal. We must examine our implementation and find the problem if this assertion fails. We have one particular edge case that we need to verify; string interpolation. Consider a string like "My name is #name". In this case, #name will be replaced by an actual value. Such interpolation will be executed in built-in functions like print and format. Let’s now add a test to verify string interpolation.

#[test]
fn should_read_equal_token8() {
    let mut lx = Lexer::new(String::from("print(\"Name #name\");"));
    assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::PRINT);
    assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::LPAREN);
    assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::STRINGLITERAL);
    assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::RPAREN);
    assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::SEMICOLON);
}

In Line 3, we give the lexer an input of the form print(\"Name #name\");. We can deduce that the first token to be identified is PRINT. The interesting part is the Name #name section. We expect the lexer to return a string literal otherwise, an error will occur because the # character is not an alphabet. We now can claim our lexer tokenizes literals.

Overview of function tokenization

A function definition in Karis begins with a binding to the fn keyword. Like other programming languages,a function in Karis will accept arguments for its internal logic. Every function in Karis must return something of a particular type. This “returns type” is annotated to the binding of the function. For example, a function that adds two integers should return the sum of type @int.Karis provides the @unit type if a function has nothing to return. See the example below:

let printer @unit = fn(name @string){
           // ^ means we discard the result of the function
    print("Name #name")
};

In the above, the printer function will print to the standard output, the name. However, since Karis demands an explicit return for each function, the @unit type satisfies the requirement. We have two categories of functions to handle; built-in and user-defined functions.

Format built-in function

Currently, we have one built-in function, print. Let’s add another format. This built-in function will evaluate a literal string containing placeholders and replace the placeholders with corresponding values. The format function will be called with the following syntax:

let echo @string = fn(name @string){
    format("#name")
};

Tokenizing this function is quite simple. All that is needed is to assert that the current sequence of characters matches format and then return.

tokens::FORMAT => Ok(tokens::Token::new(
    tokens::IdentifierKind::FORMAT,
    String::from(ident),
    self.line_number,
    self.position,
)),

Custom functions

As mentioned above, the fn keyword marks the beginning of a function definition. The task of tokenizing therefore boils down to, when the current character in focus is f, we check if the following character is n. We will define a closure inside the body of the extract_token_from_alphabet function.

let is_func_identifier_fn = |next_char: &str| {
    let current_char = self
                .input
                .get(self.position..self.read_position)
                .unwrap()
                .as_bytes()
                .first()
                .unwrap();

    let next_two_step_char =  next_char_fn(self.position + 0x2, self.read_position + 0x2, tokens::NULL);

    (*current_char == 0x46 || *current_char == 0x66)
        && (*next_char.as_bytes().first().unwrap() == 0x6e
            || *next_char.as_bytes().first().unwrap() == 0x4e)
            && (next_two_step_char == tokens::LPAREN)
};

A lot is going on here. Let’s simplify it. is_func_identifier_fn closure receives an argument next_char. We expect that next_char is an n character. In Line 1, we retrieve the current character. In this context, we expect the current character to be an f character. In Line 10, we retrieve a character two steps forward from the current position. In this context, we expect that character to be an LPAREN. Line 10 to Line 13 is where the “magic” happens. We convert and compare their Unicode representation for the characters we have just retrieved.

Unicode is a universal standard for encoding character symbols.

| Character | Unicode in Hexadecimal |

| f | 46 | | F | 66 | | n | 6E | | N | 4E |

In Line 12 is the first condition where we check whether we have a lowercase f or uppercase F.. Line 13 and Line 14 are the second condition where we check whether we have a lowercase n or uppercase N. Line 15 is the final condition where we check for the presence of left parentheses. We have encountered a custom function definition if all three conditions are true. To finish, we introduce the closure into our main logic flow.

else if is_func_identifier_fn(next_char) {
    let ident = self.input.get(self.identifier_start_read_position as usize..self.read_position + 1).unwrap();

    if ident == tokens::FUNCTION {
        self.identifier_start_read_position = -0x1;
        self.position += 1;
        self.read_position += 1;
        Ok(tokens::Token::new(
            tokens::IdentifierKind::FUNCTION,
            String::from(ident),
            self.line_number,
            self.position,
        ))
    } else {
        self.unknown_token_error(ident)
    }
}

With the above, when the is_func_identifier_fn asserts the beginning of a function, it returns a token of kind FUNCTION; otherwise, it returns an error. It should be noted that the error return is placed to satisfy Rust. It will not run in actuality.

Generic function tokenization verification

Let’s continue to expand our lexer_tests test suite. The first test will verify the syntax below.

fn() {}

After tokenization, we want the lexer to return a FUNCTION, an LPAREN, an RPAREN, ending with LBRACE and RBRACE. That is the expectation.

#[test]
fn should_read_func_0() {
    let mut lx = Lexer::new(String::from("fn() {} "));
    /* trimmed for readability */
}

When we ran the test, it passes just like we expect. The test validates that the lexer can correctly identify the building blocks of a proper Karis function.

Edge case verification

Let’s shake this up a little. We’ll define another test that uses func() {} as it’s input. Clearly, from the syntax, we can tell that func is not a valid identifier in Karis. At first glance, we may assume that the lexer throws an error. But not quite. Here is the reason why. Once a function is declared, we intend to “call” it somewhere in the source text. An intent to call a function has a suffix of (), similar to how other programming languages call functions. In this case, func is identified as a CALLER and not FUNCTION. Let’s demonstrate this.

#[test]
fn should_read_func_1() {
    let mut lx = Lexer::new(String::from("func() {} "));
    assert_eq!(
            lx.read_tokens().unwrap().token_type,
            tokens::IdentifierKind::FUNCTION
    );
}

The test fails. When we look into the test logs, it tells us that the lexer produced a CALLER while we expected a FUNCTION. In this scenario, our assertion is incorrect. The test fails.

failures:

---- lexer::lexer_tests::should_read_func_1 stdout ----
thread 'lexer::lexer_tests::should_read_func_1' panicked at 'assertion failed: `(left == right)`
  left: `CALLER`,
 right: `FUNCTION`', lexer/src/lexer.rs:1363:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

failures:
    lexer::lexer_tests::should_read_func_1

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 32 filtered out; finished in 0.00s

Let’s now revert and complete the test. This time, our expected assertion matches the test. We’ll now add one more that receives input matching the correct syntax.

Function with arguments tokenization verification

Let’s now verify that the lexer can tokenize the syntax below.

let add @int = fn(x @int, y @int) { return x + y; }

We see that the function add takes in two arguments, x and y, each of type @int then returns the summation of the integers. The expectation is that the lexer will tokenize the function and its arguments.

let   add   @int  =  fn ( x @int , y @int ) { return x + y ; }

The test follows the same pattern as all other tests we have written.

#[test]
fn should_read_func_3() {
    let mut lx = Lexer::new(String::from("let add @int = fn(x @int, y @int) { return x + y; }"));
    /* trimmed for readability */
    assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::FUNCTION);
    assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::LPAREN);
    assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::VARIABLE);
    assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::INTTYPE);
    assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::COMMA);
    assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::VARIABLE);
    assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::INTTYPE);
    assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::RPAREN);
    /* trimmed for readability */
}

Line 4 to Line 11 are the focal points of the test. Every assertion statement validates individual building blocks of a function in sequence as the lexer expects them to be ordered.

Non-existent tokens

Our lexer can handle tokens that we have specified. But what if we pass it a syntax that has a token not defined? Ideally, we should not proceed with tokenization. We have two options:

  1. Throw the error and quit the program entirely
  2. Pass the error to the caller

For our purposes, we will choose option two. This is because the lexer is part of a larger program. We want each part of the program to propagate the error to the caller instead of crashing. This mechanism will allow us to handle the error correctly and give better error messages. Rust already gives us the means to propagate such errors. In the definition of read_tokens, we return Result<tokens::Token, errors::KarisError>. The caller will then use Rust’s ? symbol to signal to propagate the error. We will see more of this in action. Let’s add a test to tokenize ~. We expect the lexer to return an error.

#[test]
fn should_return_err() {
    let mut lx = Lexer::new(String::from("~"));
    assert!(lx.read_tokens().is_err(),);
                            // ^ we want an error
}
cargo test lexer::lexer_tests::should_return_err

It works as expected. Let’s add another one for %.

#[test]
fn should_return_err_2()  {
    let mut lx = Lexer::new(String::from("%"));
    assert!(lx.read_tokens().is_err(),);
                              // ^ we want an error
}

We expect an error similar to the previous test. However, is that the correct behaviour? NO. In computer arithmetic, % is interpreted as modulus. Given variables on either side, it will return the remainder. At the minimum, Karis ought to support this behaviour. However, our tokens definition is missing the % entry. Let’s add it. We’ll begin from tokens.rs by adding:

pub const MODULUS: &str = "%";

Next, in the IdentifierKind enum, we add:

MODULUS,  // "%"

Finally, we introduce a check for the new token in lexer.rs.

tokens::MODULUS => Ok(tokens::Token::new(
    tokens::IdentifierKind::MODULUS,
    ch_owned,
    self.line_number,
    self.position,
)),

This time, when we run the test, it will fail because we expect an error while the lexer successfully tokenizes %. Let’s modify the test to suit its true intent and run it.

#[test]
fn should_read_modulus()  {
    let mut lx = Lexer::new(String::from("%"));
    assert!(lx.read_tokens().is_ok(),);
}

Multi-line syntax

We’ll move to verify that the lexer can correctly work on multi-line input. As we are used to other programming languages, we write programs into text files with the expectation that the language interpreter or compiler will consume the text files and run the program. This is the same for Karis. It is, therefore, prudent to have a lexer that consumes characters served from the file system.

Generic multi-line syntax

We’ll start with something simple —a syntax with only variable declarations.

let sum @int = 1 + 2;

let person_age @int = 25;

let is_active @bool = true;

let is_live @bool = false;

We see that there are a lot of whitespaces. That ok. It’s the very essence of supporting multi-line syntax. In our lexer, whitespace is just a separator. If the lexer is implemented correctly, it should tokenize the above syntax without a hiccup. Add the test below:

#[test]
fn should_read_multiline0() {
    let mut lx = Lexer::new(String::from(
        "
        let sum @int = 1 + 2;

        let person_age @int = 25;

        let is_active @bool = true;

        let is_live @bool = false;

        ",
    ));
    /* trimmed for readability */
}

Multi-line syntax : function definition and call

We’ll now verify the lexer with input that resembles a program that may be written in Karis. It’s simple. We define a function add that takes two arguments of type int and returns a summation. Then we call the function. The syntax will look like the one below:

let add @int = fn(x @int, y @int){
    return x + y;
};

add(1,3);

Similar to the previous attempt, there are a lot of whitespaces. Add the test below.

#[test]
    fn should_read_multiline1() {
        let mut lx = Lexer::new(String::from(
            "
        let add @int = fn(x @int, y @int){
            return x + y;
        };

        add(1,3);

        ",
        ));
        /* trimmed for readability */
}

Multi-line syntax : function definition with conditional statements

Next is a relatively similar function with conditional statements.

#[test]
fn should_read_multiline2() {
    let mut lx = Lexer::new(String::from(
        "
        let greater @int = fn(x @int, y @int) {
            if x > y {
                return x;
            }
            return y;
        };        ",
    ));

    /* trimmed for readability */
}

Multi-line syntax : program entry point

Here we need to verify that a program’s entry point will be tokenized correctly. Recall that Karis programs are executable by default. Therefore, they need an entry point that collects all definitions within it before the execution process begins. A sample program looks like the one below.

We expect the lexer to correctly know where a program begins and ends and recognize identifiable entities. Let’s add a test for it.

#[test]
fn should_read_multiline_main() {
    let mut lx = Lexer::new(String::from(
            "
        @main fn(){
            let x @int = 5;
            let name @string = \"Karis\";

        }@end;

        ",
    ));
    /* trimmed for readability */
}

We are now done with all tests. At this point, we can confidently state that the lexer is complete. But nothing is ever really done in software development. At some point, we’ll revisit the lexer. Not to change it but to extend it.

The console crate setup

We introduce a new program called console. The name is intentional as it will contain APIs to interact with the compiler. The end-user will be able to issue commands console to run some workflow in the compilation process. The console program will be a new crate in our workspace. We create it by using the cargo that comes with Rust.

cargo new console

The command creates the new crate. The output indicates that we need to add the newly created crate to be part of the current workspace otherwise it won’t be able to reuse the Cargo.lock file in the root of the workspace. To add console to the workspace, we amend the Cargo.toml in the root of the workspace by appending console to the members’ list.

[workspace]
members = ["lexer","errors", "console"]

While the name console fits our purpose by making it easy to think about what we are creating, what we need is a name that the end user can associate with Karis programming language and invoke at any time. Let’s demonstrate this notion with Python programming language. The language comes with a command python that can be typed into the terminal to start interacting with the language. That is what we need for Karis. To accomplish this change, we need to change the Cargo.toml inside the console directory. Let’s open the file and add the following entry:

[[bin]]
name = "karis"
path = "src/main.rs"

The entry tells the Rust compiler that the binary of the console program should be addressed as Karis.

Lets verify the changes works correctly by running the following command:

cargo install --path . && karis

The command compiles, installs and runs the program. The output Hello, world! indicates that the program can now be referenced with the name we desire. Let’s proceed to introduce dependencies for the console crate. Open Cargo.toml in the console directory and add the following entries under the dependencies section:

clap = { version = "3.2.12", features = ["cargo"] }
lexer = { path = "../lexer" }
errors = { path = "../errors" }

We have encountered the last two entries. The first entry, clap is useful to allow us to develop a command-line application which is exactly what we need. For our current use case, we will build a Read-Lexer-Print-Loop application. As the name indicates, the application will receive input and perform lexing operations on it as it prints to the terminal.

let matches = Command::new(KARIS_WELCOME_MESSAGE)
        .version("v0.1.0")
        .propagate_version(true)
        .subcommand_required(true)
        .arg_required_else_help(true)
        .subcommand(
            Command::new("rlpl")
                .about("Read-Lexer-Print-Loop extracts tokens and prints on stdout")
                .arg_required_else_help(true)
                .arg(
                    Arg::new("interactive")
                        .short('i')
                        .long("interactive")
                        .action(ArgAction::SetTrue)
                        .required(false),
                )
                .arg(arg!(-p --filepath <PATH>).required(false)),
        )
        .get_matches();

In Line 1, we bootstrap the application by instantiating the Command type from clap. The application allows us the flexibility to add custom attributes. In Line 2, we state the version of the application as v0.1.0. In Line 6, we introduce an instance of a subcommand that the application will receive from the user. In this instance, the subcommand is called rlpl. We want the subcommand to receive arguments that will direct how it’s operation run. In Line 10 to Line 17, we indicate that the subcommand should expect either interactive and filepath arguments. The command below demonstrate how the program may be run:

karis rlpl -interactive --filepath <path/to/karis/program/file>

Next, we will add an intercept implementation that will listen to sub-commands and execute them appropriately.

match matches.subcommand() {
        Some(("rlpl", sub_matches)) => {
            let interactive = sub_matches.get_one::<bool>("interactive");
            let file_path = sub_matches.get_one::<String>("filepath");

            if let Some(i) = interactive {
                if *i {
                    return lexer_interactive();
                }
            }

            if let Some(file_path) = file_path {
                return lexer_from_file(file_path);
            }
            Ok(())
        }

        _ => {
            println!("Nothing to do");
            Ok(())
        }
    }

From Line 6 to Line 10, when the interactive argument is provided, we invoke the lexer_interactive function. When we say “interactive”, we mean the user who calls the subcommand can invoke more than one command in a single session. The equivalent is true when filepath argument is provided as seen from Line 12 to Line 14. Let’s go further into the interactive function.

fn lexer_interactive() -> Result<(), KarisError> {
    println!("{}\n", KARIS_INTERACTIVE_MESSAGE);
    println!("Copy-Paste or type your Karis program after the prompt\n",);
    println!("Type :exit to close the console\n",);

    let mut input = String::new();

    loop {
        print!("{} ", PROMPT);

        io::stdin().read_line(&mut input)?;

        let text = input.trim();
        if text.is_empty() {
            println!("Nothing to scan \n");
        } else {
            if text == EXIT {
                println!("Closing interactive console. Catch you later :) \n");
                process::exit(0)
            }

            let mut lx = lex::Lexer::new(String::from(text));
            lx.generate_and_print();
        }

        input.clear();
    }
}

The block from Line 6 to Line 25 is the core of the implementation. The loop runs infinitely, reading input from the standard input and then passing it to the lexer on Line 20. The lexer_from_file function works in a similar manner with the only exception that it reads from a file in the filesystem. Let’s build and run the rlpl. Run the following command:

cargo run --bin=karis -- rlpl -p testdata/00/test00.kr

Perfect. It generates and prints tokens as expected.

Share

© Mwangi Kariuki 2019-2024