November 17, 2022

Crafting a compiler in Rust : Chapter 7

Have you ever wondered how programming languages like Python, Java or Ruby work? Do you have an interest in building your own language? If so, this series of posts is for you. We dive into building a novel programing language from first principles. Get ready to have your hands dirty. It’s going to be a fun ride. Access the final product here : https://github.com/daviddexter/karis-programming-language

Expression function

We start by introducing an associate function of the Parser object called expression. In Rust, an associate function is not attached to an instance of an object. It is synonymous with a static function in object-oriented languages. We use an associate function because we want to call expression in a recursive manner without instantiating the Parser object.

pub fn expression(rbp: usize,index: usize,bucket: Rc<RefCell<Vec<Token>>>) -> Result<(Objects, usize), errors::KarisError> {
        todo!("implement this")
}

The argument bucket holds all tokens as generated by the lexer. Consider a source text below:

let add @int = fn(x @int, y @int){  return x + y; };

Inside the bucket vector, we expect it to contain tokens as below:

[
    Token { token_type: LET, literal: "let", line_number: 1, column_number: 2 },
    Token { token_type: VARIABLE, literal: "add", line_number: 1, column_number: 6 },
    Token { token_type: INTTYPE, literal: "@int", line_number: 1, column_number: 11 },
    Token { token_type: ASSIGN, literal: "=", line_number: 1, column_number: 13 },
    Token { token_type: FUNCTION, literal: "fn", line_number: 1, column_number: 16 },
    Token { token_type: LPAREN, literal: "(", line_number: 1, column_number: 17 },
    Token { token_type: VARIABLE, literal: "x", line_number: 1, column_number: 18 },
    Token { token_type: INTTYPE, literal: "@int", line_number: 1, column_number: 23 },
    Token { token_type: COMMA, literal: ",", line_number: 1, column_number: 24 },
    Token { token_type: VARIABLE, literal: "y", line_number: 1, column_number: 26 },
    Token { token_type: INTTYPE, literal: "@int", line_number: 1, column_number: 31 },
    Token { token_type: RPAREN, literal: ")", line_number: 1, column_number: 32 },
    Token { token_type: LBRACE, literal: "{", line_number: 1, column_number: 33 },
    Token { token_type: RETURN, literal: "return", line_number: 1, column_number: 41 },
    Token { token_type: VARIABLE, literal: "x", line_number: 1, column_number: 43 },
    Token { token_type: PLUS, literal: "+", line_number: 1, column_number: 45 },
    Token { token_type: VARIABLE, literal: "y", line_number: 1, column_number: 47 },
    Token { token_type: SEMICOLON, literal: ";", line_number: 1, column_number: 48 },
    Token { token_type: RBRACE, literal: "}", line_number: 1, column_number: 50 },
    Token { token_type: SEMICOLON, literal: ";", line_number: 1, column_number: 51 },
    Token { token_type: EOS, literal: "", line_number: 1, column_number: 52 }
]

Every time we call the expression function, we will pass the index of the token that we intend to parse. The rbp argument is the binding power of the right token. Using the token bucket above, if we parse the first token which is LET, we then pass the rbp of the next token which is VARIABLE. The actual value of rbp does not matter. It’s up to us as language designers to assign each token an appropriate rbp value. Let’s now add some logic inside the function.

let token = &bucket.borrow()[index];

if token.token_type == IdentifierKind::EOS {
    return Ok((Objects::TyConsumable, index));
}

In Line 1 we extract the token in focus and try to figure out whether it’s of type EOS. We do this first because the EOS token has no meaning to the Abstract syntax tree. Let’s take a step back to understand the core of the Pratt parser. Instead of using a dictionary of vocabulary, the Pratt parser allows us to inject syntax logic as part of the parsing process. Each token belongs to one of the following class categories:

This means that for each kind of token, we need to indicate beforehand whether it is NUD or LED. We achieve this by using a HashMap. We choose a hashmap because it allows us to access its values using a token as a key and because we have a finite set of tokens performance would not be an issue. Let’s first introduce a function type for NUD and LED operations.

type NudParserOp =
    fn(Token, usize, Rc<RefCell<Vec<Token>>>) -> Result<(Objects, usize), errors::KarisError>;

type LedParserOp =
    fn(Objects, usize, Rc<RefCell<Vec<Token>>>) -> Result<(Objects, usize), errors::KarisError>;

Next, we define a ParserType type that specifies a unique NudParserOp and/or LedParserOp and a binding_power for a specific token. This type will be used internally by the parser to initiate the appropriate parsing operation.

#[derive(Debug, Default, Clone)]
pub struct ParserType {
    pub nud_fn: Option<NudParserOp>,

    pub led_fn: Option<LedParserOp>,

    pub binding_power: Option<usize>,
}

We now introduce a TokenRegistry that is a hash map of a token identifier and it’s corresponding ParserType.

#[derive(Debug, Default, Clone)]
pub struct TokenRegistry {
    pub registry: HashMap<IdentifierKind, ParserType>,
}

impl TokenRegistry {
    pub fn new() -> TokenRegistry {
        let mut tg = TokenRegistry::default();
        tg.get_token_registry();
        tg
    }

    fn get_token_registry(&mut self) {
        self.consumable(IdentifierKind::SEMICOLON);
        self.consumable(IdentifierKind::EOS);
        self.consumable(IdentifierKind::EOF);
        /* trimmed for readability */
    }

    fn register(&mut self, symbol: IdentifierKind, obj: ParserType) {
        if let Some((_key, val)) = self.registry.get_key_value_mut(&symbol) {
            if let Some(f) = obj.nud_fn {
                val.nud_fn = Some(f);
            }

            if let Some(f) = obj.led_fn {
                val.led_fn = Some(f);
            }

            if let Some(f) = obj.binding_power {
                val.binding_power = Some(f);
            }
        } else {
            self.registry.insert(symbol, obj);
        }
    }
}

In Line 8 we initialize the registry with default values then call the get_token_registry function that sets up different ParserType for each token. With the registry ready, we can now return to complete the expression function. We add a closure that will take a token, fetch it’s ParserType and return it.

let parser_type_fn = |s: IdentifierKind| -> Result<ParserType, errors::KarisError> {
    let rg = TokenRegistry::new();
    let parser_type = rg.retrieve_from_registry(s);
    if parser_type.is_none() {
        return Err(errors::KarisError {
            error_type: errors::KarisErrorType::UnknownToken,
            message: format!(
                "[UNKNOWN TOKEN] Token {:?} Ln {} Col {}",
                token.literal, token.line_number, token.column_number
            ),
        });
    }

    let pt = parser_type.unwrap().clone();
    Ok(pt)
};

Once we have the appropriate ParserType we call both the nud and led functions for the token.

let (mut left, mut worked_on_index) = match pt0.nud_fn {
    Some(func) => {
        let res = func(token.clone(), index, bucket.clone())?;
        (res.0, res.1)
    }
    None => (Objects::TyUnknown, 0x00),
};

In Line 1 we match whether the current token has a nud function. If present, we call it and return it’s result as the left-hand side expression. Otherwise we return an unknown object. The the left-hand side at hand, we now move to the led function.

if let Some(next_token) = bucket.borrow().get(worked_on_index + 0x01) {
    let pt1 = parser_type_fn(next_token.token_type)?;
    if let Some(bp) = pt1.binding_power {
        if rbp < bp {
            if let Some(func) = pt1.led_fn {
                let idx = worked_on_index + 0x01;
                let res = func(left, idx, bucket.clone())?;
                left = res.0;
                worked_on_index = res.1;
            } else {
                return Err(errors::KarisError {
                    error_type: errors::KarisErrorType::InvalidSyntax,
                    message: format!(
                        "[INVALID SYNTAX] Not an infix; Token {:?} Ln {} Col {}",
                        token.literal, token.line_number, token.column_number
                    ),
                });
            }
        }
    } else {
        return Err(errors::KarisError {
            error_type: errors::KarisErrorType::MalformedProgram,
            message: format!(
                "[MISSING BINDING POWER] ; Token {:?} Ln {} Col {}",
                token.literal, token.line_number, token.column_number
            ),
        });
    }
}

In Line 1 use the index of the current token to access the next token from the bucket and extract its ParserType as seen in Line 2. In Line 4 we check if the binding of the next token is less than the value of rbp. If the assertion passes, we check for the presence of its led function and then call it. For complex expressions, the process calls itself over and over until the entire expression is parsed. With that, we have the core functionality of our parser.

Register LET token parsing procedure methods

We begin with the TokenRegistry. We define a method that registers the nud and led procedure for the LET token.

fn add_let_binding(&mut self) {
    let obj = ParserType {
        nud_fn: Some(Self::parse_let_expressions),
        led_fn: None,
        binding_power: Some(10),
    };
    self.register(IdentifierKind::LET, obj);
}

In Line 2 we instantiate a #key# ParserType: Object containing token Null-Denotation and Left-Denotation functions #key# passing in corresponding values. Let’s first describe how the LET token will be parsed. Recall the nud function is meant to handle situations where there is no left context while the led function is meant to be called when there is a left context. Consider a let binding in Karis:

let age @int = 10;

The token in focus is let which has anything defined before it. In this case, we state that let has not left context. Therefore parsing the LET token will involve reading the entire statement from left to right until the semicolon is encountered. Thus, we only need to specify the nud function which we do in Line 3. In Line 5 we give the token a binding power of 10. The actual value of the binding power is arbitrary. In the last line, we now add the ParserType object for the LET token in the register.

Implementing the LET parser

As shown above, the core of parsing the LET resides in the parse_let_expressions function. The base principle to parse a LET expression goes as follows:

pub(crate) fn parse_let_expressions(tok: Token,index: usize,bucket: Rc<RefCell<Vec<Token>>>) -> Result<(Objects, usize), errors::KarisError> {
    if tok.token_type != IdentifierKind::ASSIGN {
        let new_index = index + 0x1;
        let next_token = &bucket.borrow()[new_index];
        return Self::parse_let_expressions(next_token.clone(), new_index, bucket.clone());
    }
}

In Line 2 to Line 5 we move along the expression recursively until we reach the ASSIGN token. The ASSIGN token is a signal that the tokens on the left contain a binding variable and typing information while the token on the right contains either a value or another expression. With the ASSIGN token found, we now dissect the bucket to find where the variable name and typing information are. Consider the following example:

let age @int = 10;

As from the above, we can see that the LET expression structure is deterministic and we can accurately determine the index of tokens left of the ASSIGN.

With this information, we can now add some additional logic to check and verify the presence of both variable name and typing information.

// parser validation : check if variable name has been specified
let variable_name_token = borrow.get(index_of_let_ident + 0x01);
if variable_name_token.is_none() {
    return Err(errors::KarisError {
        error_type: errors::KarisErrorType::MissingVariableName,
        message: format!(
            "[MISSING VARIABLE NAME] Expected to find `<variable name>` ; Token {:?} Ln {} Col {}",
            tok.literal, tok.line_number, tok.column_number
        ),
    });
}

// parser validation : check if typing information has been provided
let typing_token = borrow.get(index_of_let_ident + 0x02);
if typing_token.is_none() {
    return Err(errors::KarisError {
        error_type: errors::KarisErrorType::MissingTypeInfo,
        message: format!(
            "[MISSING TYPE INFO] Expected to find either `@int | @string | @bool | @unit` ; Token {:?} Ln {} Col {}",
            tok.literal, tok.line_number, tok.column_number
        ),
    });
}

As mentioned earlier, the right-hand side of the ASSIGN token can be either a value or another expression. For now, we will restrict ourselves to handling only values.

let typing_token = typing_token.unwrap();
if typing_token.token_type == IdentifierKind::LSQUAREBRACE {
            /* trimmed for readability */
} else {
    let node = Node {
        identifier_kind: Some(IdentifierKind::LET),
        variable_name: Some(variable_name_token.unwrap().literal.clone()),
        return_type: Some(Self::typing_kind(typing_token)),
        ..Default::default()
    };

    // return index that points to the `typing` not the `Assign`
    Ok((Objects::TyNode(node), index - 0x1))
}

In Line 5 to Line 10 we construct a Node object for the LET token and return back up the stack to build our abstract syntax tree. This is only true for values. For expressions, the logic would involve calling the expression function which will recursively build nodes for each part of the expression. With that our parser is ready to parse the LET expressions.

Register RETURN token parsing procedure methods

We begin with the #key# TokenRegistry: HashMap of token parsing procedures #key#. We define a method that registers the nud and led procedure for the RETURN token.

fn add_return(&mut self) {
    let obj = ParserType {
        nud_fn: Some(Self::parse_return),
        led_fn: None,
        binding_power: Some(0x00),
    };
    self.register(IdentifierKind::RETURN, obj);
}

Because the RETURN does not have any left context, we will only define it’s nud function. In Line 5 we set it’s `binding power as zero(0) which indicates the token a the least precedence.

Implementing the RETURN parser

Let’s consider the following simple return statements:

return true;
return 10;
return add(1,2);

From the above, we can see that each return statement terminates with a SEMICOLON. What is returned to the call can be either a value or an expression as seen in Line 3. We can now deduce how the actual parser will work:

Let’s see the implementation in action.

pub(crate) fn parse_return(tok: Token,index: usize,bucket: Rc<RefCell<Vec<Token>>>) -> Result<(Objects, usize), errors::KarisError> {
    let (obj, index) = Parser::expression(0, index + 0x01, bucket.clone())?;
}

The current index in Line 1 points to the RETURN token. We, therefore, increase the index by one and call the expression function on the new index as seen in Line 2. The result will be a node object and the index of the last token to be parsed. The next step is to check for the presence of a SEMICOLON.

let borrow = bucket.borrow();

let next_index = index + 0x01;
let next_token = borrow.get(next_index);
if next_token.is_none() {
    return Err(errors::KarisError {
        error_type: errors::KarisErrorType::InvalidSyntax,
        message: format!(
            "[INVALID SYNTAX] Syntax not correct. Expected something after `{}. Ln {} Col {}  '",
            tok.literal,tok.line_number,tok.column_number
        ),
    });
}

let next_token = next_token.unwrap();

if next_token.token_type != IdentifierKind::SEMICOLON {
    return Err(errors::KarisError {
        error_type: errors::KarisErrorType::InvalidSyntax,
        message: format!(
            "[INVALID SYNTAX] Syntax not correct. Expected tail `;` `{}. Ln {} Col {}  '",
            tok.literal, tok.line_number, tok.column_number
        ),
    });
}

In Line 1 we get a copy of our bucket of tokens. Using the index from the result of the expression call, we increase the index again and retry to access its value from the bucket as seen in Line 3 and Line 4. A valid syntax should have a token at that position. Otherwise, we return an error as seen in Line 6. If a token is found we can confidently extract its value in Line 15 and then try to identify whether it is a SEMICOLON. We expect a SEMICOLON to be present, otherwise, it means our syntax is incorrect hence we return an error as seen in Line 17. If everything checks out, we need to build a return object node that will be appended in our abstract syntax tree.

let node = Node {
    identifier_kind: Some(IdentifierKind::RETURN),
    right_child: Some(Right(Box::new(obj))),
    ..Default::default()
};

// move the index to the adjacent `rbrace` token
Ok((Objects::TyNode(node), index + 0x01))

It is important to note we should move the cursor to point to the token after the semicolon. According to our syntax rules, this token will be the right-hand closing brace or RBRACE for short. We do this so that the next run of expression function can begin at the right position.

To correctly parse a string literal we first need to understand how a string literal is framed in Karis Language. Consider the following example:

let name @string = "Karis";

We can see that the string begins after the ASSIGN token. This means that we need a handle for tokens after the ASSIGN token. It should be noted that these tokens can be literals or other complex expressions like functions. We will however limit ourselves to handling strings.

Right-hand side tokens parsing

We begin by instantiating a ParserType object inside the add_assign function.

fn add_assign(&mut self) {
    let obj = ParserType {
        nud_fn: None,
        led_fn: Some(Self::parse_assign_operator),
        binding_power: Some(30),
    };
    self.register(IdentifierKind::ASSIGN, obj);
}

The most vital in Line 4 which injects a LED function. For the ASSIGN token it has both the left context and right context. It’s resulting node object needs to have both sides which may resemble the below:

      ASSIGN(=)
      /       \
     /         \
   LET(name)  "karis"

In Line 5 we set it’s binding power to a significantly higher value because we it to always be at the center of the resulting node object. We now proceed to introduce the parse_assign_operator function.

pub(crate) fn parse_assign_operator(left: Objects,token_index: usize,bucket: Rc<RefCell<Vec<Token>>>) -> Result<(Objects, usize), errors::KarisError> {
    todo!()
}

Notice in Line 1 that the first argument the function expects is an Object which indicates the left-hand side of the assignment. In all scenarios, the left-hand side will always be a LET binding. Recall that we will build an appropriate node object hence we need to have access to the left-hand side node object. Next, we add a few validation checks to make sure the passed input is of the correct syntax.

let (current_token, next_token) = Self::preconditions(
    token_index,
    bucket.clone(),
    vec![
        IdentifierKind::EOS,
        IdentifierKind::EOF,
        IdentifierKind::SEMICOLON,
        IdentifierKind::LET,
        IdentifierKind::GT,
        IdentifierKind::LT,
        IdentifierKind::GTOREQ,
        IdentifierKind::LTOREQ,
        IdentifierKind::EQ,
        IdentifierKind::SLASH,
        IdentifierKind::MODULUS,
        IdentifierKind::MAIN,
        IdentifierKind::END,
    ],
)?;

The main idea here is that if the next token after the ASSIGN token is not what the parser expects we then return an error. Let’s demonstrate such erroneous syntax with a few examples:

Error Snippet
Incomplete assignment karis let name @string =
Invalid program karis =
Invalid Keyword assignment ```karis
let x @string = let;
let x @string = @main;
let x @string = @end;
```
Invalid assignment ```karis
let name @string = ;
let a @int = >;
let b @int = <;
let c @int = >=;
let d @int = <=;
let e @int = =;
let e @int = /;
let e @int = %;
```

The next thing we do is to check if the next token, two steps forward is a SEMICOLON. We do this because a SEMICOLON will indicate that we are dealing with a literal and not complex expression.

let borrow = bucket.borrow();
let two_step_token = borrow.get(token_index + 0x02);
if let Some(two_step_token) = two_step_token {
    if two_step_token.token_type == IdentifierKind::SEMICOLON {
        todo!()
    }
}

In Line 1 we get a copy of the bucket containing all the tokens. In Line 2 we move the cursor two steps forward by increasing the current index by two then retrieve the token pointed by the new index. In Line 4 we check whether the found token is a SEMICOLON. Inside the block, we now add logic to check if the next token is a STRINGLITERAL and then return its appropriate node object.

if next_token.token_type == IdentifierKind::STRINGLITERAL {
    let obj = StringValue {  value: Some(next_token.literal) };
    let literal = LiteralObjects::ObjStringValue(obj);
    let node = Node {
        identifier_kind: Some(current_token.token_type),
        left_child: Some(Right(Box::new(left))),
        right_child: Some(Left(literal)),
        ..Default::default()
    };

    // move the cursor to the end. This will finish the recursive call stask and return to the caller
    return Ok((Objects::TyNode(node), token_index + 0x02));
}

With that, we can confidently parse string literals.

Share

© Mwangi Kariuki 2019-2024