November 29, 2022

Crafting a compiler in Rust : Chapter 8

Have you ever wondered how programming languages like Python, Java or Ruby work? Do you have an interest in building your own language? If so, this series of posts is for you. We dive into building a novel programing language from first principles. Get ready to have your hands dirty. It’s going to be a fun ride. Access the final product here : https://github.com/daviddexter/karis-programming-language

Register INT token parsing procedure methods

To Parsing integer literals we begin by instantiating and registering its ParserType object.

fn add_int_literal(&mut self) {
    let obj = ParserType {
        nud_fn: Some(Self::parse_int_literal),
        led_fn: None,
        binding_power: Some(0x00),
    };
    self.register(IdentifierKind::INTLITERAL, obj);
}

Next we introduce the parse_int_literal function which does the actual work of parsing integers.

pub(crate) fn parse_int_literal(tok: Token,index: usize,_bucket: Rc<RefCell<Vec<Token>>>) -> Result<(Objects, usize), errors::KarisError> {
    let as_isize = tok.literal.parse::<isize>();
    if as_isize.is_err() {
        return Err(errors::KarisError {
            error_type: errors::KarisErrorType::UnableToConvert,
            message: format!(
                "[FAILED CONVERSION] Failed to convert to isize ; Token {:?} Ln {} Col {}",
                tok.literal, tok.line_number, tok.column_number
            ),
        });
    }
}

Notice in Line 1 that we pass the current token to the function. This token is expected to be an integer. In Line 2 we access the literal value inside the token object and try to convert it to an # key isize #key. In the event the conversion is unsuccessful we return an error which signals that the token is not an integer. The final step is to wrap the integer value as a node object and return it to build the abstract syntax tre

let value = as_isize.unwrap();
let int = integerValue { value: Some(value) };
let obj = LiteralObjects::ObjintegerValue(int);
let node = Node {
    identifier_kind: Some(IdentifierKind::INTLITERAL),
    left_child: Some(Left(obj)),
    ..Default::default()
};
let obj_type = Objects::TyNode(node);
Ok((obj_type, index))

Update add_assign function

If we attempt to parse an integer presently, it will fail because we have not yet hooked the parsing logic to the ASSIGN token. Let’s change that. Inside the add_assign function, we’ll add a new check for integer types.

if next_token.token_type == IdentifierKind::INTLITERAL {
    let (literal, _idx) = Self::parse_int_literal(next_token, token_index + 0x01, bucket.clone())?;

    let node = Node {
        identifier_kind: Some(current_token.token_type),
        left_child: Some(Right(Box::new(left))),
        right_child: Some(Right(Box::new(literal))),
        ..Default::default()
    };

    // move the cursor to the end. This will finish the recursive call stask and return to the caller
    return Ok((Objects::TyNode(node), token_index + 0x02));
}

In Line 2 after checking for the presence of an INTLITERAL we call the parse_int_literal function by passing it the token, its index and a copy of the bucket. We then create a Node object from the result of its operation and return it. We are now ready to parse integer literals.

Register TRUE and FALSE token parsing procedure methods

We begin by instantiating and registering ParserType objects for each boolean type. Recall that the Karis lexer treats each boolean derivative as a unique token. However, the actual parsing procedure will be the same.

fn add_boolean_true_literal(&mut self) {
    let obj = ParserType {
        nud_fn: Some(Self::parse_boolean_literal),
        led_fn: None,
        binding_power: Some(0x00),
    };
    self.register(IdentifierKind::TRUE, obj);
}

fn add_boolean_false_literal(&mut self) {
    let obj = ParserType {
            nud_fn: Some(Self::parse_boolean_literal),
            led_fn: None,
            binding_power: Some(0x00),
    };
    self.register(IdentifierKind::FALSE, obj);
}

As we can see in Line 3 and Line 11 both tokens share the same parse_boolean_literal. Let’s now introduce it.

pub(crate) fn parse_boolean_literal(tok: Token,index: usize,_bucket: Rc<RefCell<Vec<Token>>>) -> Result<(Objects, usize), errors::KarisError> {
    todo!("add implementation here")
}

In Line 1 we pass in the current token, its corresponding index and a copy of the bucket. We expect that the current token will always be of kind TRUE or FALSE. Anything else will be an invalid syntax. Notice that the bucket is prefixed with an underscore. This is a Rust feature indicating that the will not be used inside the function body. As stated, the first thing we do is to check for syntax validity:

let as_bool = tok.literal.parse::<bool>();
if as_bool.is_err() {
    return Err(errors::KarisError {
        error_type: errors::KarisErrorType::UnableToConvert,
        message: format!(
            "[FAILED CONVERSION] Failed to convert to bool ; Token {:?} Ln {} Col {}",
            tok.literal, tok.line_number, tok.column_number
        ),
    });
}

In Line 1 we try to convert the inner value of the token object to an actual boolean type. If the conversion fails we return an error which means the source text has a syntax error. If the conversion succeeds, we only need to wrap the inner value in a node object type and return it back to the caller.

let value = as_bool.unwrap();
let bool_val = BooleanValue { value: Some(value) };
let obj = LiteralObjects::ObjBooleanValue(bool_val);
let node = Node {
    identifier_kind: Some(IdentifierKind::BOOLEANLITERAL),
    left_child: Some(Left(obj)),
    ..Default::default()
};
let obj_type = Objects::TyNode(node);

Ok((obj_type, index))

Currently trying to parse any boolean value will fail. While we have the correct parsing procedure we are yet to inject it where it can be called. The point where we add is in the ASSIGN token parsing procedure. Like any literal, boolean literal values appear in binding expressions and in return statements. For example;

let active @bool = true;
return false;

No change needs to be made in regard to return statements. However, we need to update the parse_assign_operator function to account for boolean values. Lucky for us, the change is a simple logical addition:

if next_token.token_type == IdentifierKind::TRUE || next_token.token_type == IdentifierKind::FALSE   {
    let (literal, _idx) = Self::parse_boolean_literal(next_token,token_index + 0x01, bucket.clone())?;
    let node = Node {
        identifier_kind: Some(current_token.token_type),
        left_child: Some(Right(Box::new(left))),
        right_child: Some(Right(Box::new(literal))),
        ..Default::default()
    };

    // move the cursor to the end. This will finish the recursive call stack and return to the caller
    return Ok((Objects::TyNode(node), token_index + 0x02));
}

In Line 1 we assert whether the next token is of kind TRUE or FALSE. If the assertion succeeds we call the newly added parse_boolean_literal function to do the heavy lifting. We then wrap the result of the parsing function in a Node object and pass the boolean object as the right-hand side child of the ASSIGN node. With that addition, we can now parse boolean values.

Overview of infix operators parsing

Parsing infix operators is perhaps one of the interesting parts of designing a parser. With infix operators, we get to fully appreciate the Pratt parser specifically the idea of operator precedence guided by binding power. In our current context we will resolve the following infix operators:

Symbol Token Kind
+ PLUS
- MINUS
% MODULUS
* ASTERISK
/ SLASH

Similar to other tokens, we begin by adding a function to register the appropriate ParserType object for each infix operator.

fn add_infix(&mut self, symbol: IdentifierKind, binding_power: usize) {
    let obj = ParserType {
            nud_fn: None,
            led_fn: Some(Self::parse_infix_operator),
            binding_power: Some(binding_power),
    };
    self.register(symbol, obj);
}

However unlike other parser procedures, we have designed so far, the add_infix function requires us to pass an explicit value as the token’s binding power. Let’s see how we do this for infix operators:

self.add_infix(IdentifierKind::PLUS, 30);
self.add_infix(IdentifierKind::MINUS, 30);
self.add_infix(IdentifierKind::MODULUS, 30);
self.add_infix(IdentifierKind::ASTERISK, 35);
self.add_infix(IdentifierKind::SLASH, 40);

Notice that the first three calls pass a binding power value of 30 while the last two have different values. Why is this? Recall operator precedence. In mathematics, multiplication is more significant than either an addition or a subtraction while division is more significant than multiplication. Simply said, BODMAS. Hence the SLASH token which represents the division operation has the highest binding power followed by the ASTERISK token. PLUS and MINUS tokens have the same binding power. Neither is more significant that the other. We can prove this with a mathematic expression:

10 + 5 - 3  = 12
10 - 5 + 3  = 8

As we can see a shift of the positions of either + or - affects the result of the operation since the operation is handled from left to right.

parse_infix_operator function

Next, we introduce the parse_infix_operator function which will handle the actual operation of parsing infix operators.

pub(crate) fn parse_infix_operator(left: Objects,token_index: usize,bucket: Rc<RefCell<Vec<Token>>>) -> Result<(Objects, usize), errors::KarisError> {
    todo!("add implementation here")
}

The first thing we do is to check for the validity of the syntax. We do this by asserting whether the current token matches a set of tokens which are invalid:

let (current_token, _next_token) = Self::preconditions(token_index,
            bucket.clone(),
            vec![
                IdentifierKind::EOS,
                IdentifierKind::EOF,
                IdentifierKind::LET,
                IdentifierKind::FUNCTION,
                IdentifierKind::TRUE,
                IdentifierKind::FALSE,
                IdentifierKind::STRINGLITERAL,
            ],
)?;

We see from Line 4 to Line 10 the set of tokens we don’t expect the current token to be. The presence of any of such will deem the source text invalid. We have another set of peculiar infix operators which we need to think about: openings, and closing parentheses. In our implementation, we’ll add a check for the presence of an opening parenthesis.

if current_token.token_type == IdentifierKind::LPAREN {
    todo!("handle this later")
}else{
    todo!("add implementation")
}

For now, we’ll skip handling the opening parentheses to focus on the main infix operators. Next, we extract the binding power of the current token.

let current_operator_bp = rg
    .retrieve_from_registry(current_token.token_type)
    .unwrap()
    .binding_power
    .unwrap();

We then the expression function passing the current_operator_bp and the next token in the bucket. The result will be passed to the right-hand side child.

let right_after_current_token = Parser::expression(
                        current_operator_bp,
                        token_index + 0x01,
                        bucket.clone(),
)?;

let left_node = Objects::TyNode(Node {
    identifier_kind: Some(current_token.token_type),
    left_child: Some(Right(Box::new(left))),
    right_child: Some(Right(Box::new(right_after_current_token.0))),
    ..Default::default()
});

Next we expression function again this time passing an rpb of zero and a token three steps forward from the current token. The result from this call will build the left-hand side child.

let operator_token = borrow.get(right_after_current_token.1 + 0x01).unwrap();
let next_token_nud = Parser::expression(
    0x00,
    right_after_current_token.1 + 0x02,
    bucket.clone(),
)?;

Finally, we wrap both outputs and construct a node object that represents the infix expression.

let node = Node {
    identifier_kind: Some(operator_token.token_type),
    left_child: Some(Right(Box::new(left_node))),
    right_child: Some(Right(Box::new(next_token_nud.0))),
    ..Default::default()
};
Ok((Objects::TyNode(node), next_token_nud.1))

Karis has only three prefix operators; +, - and ! which may appear in contexts much like:

let num @int = 1 + -2;
let num @int = 1 + +2;
let num @int = 1 - +2;
!false
!!false
!!!!false

Line 1 is the must obvious case of a prefix operator with the presence of a negative integer. Cases in Line 2 and Line 3 offer more context on how prefix operators involving integers can be framed. In Line 4 the ! token indicates an intent to negate. More and more negation intents can be chained together as see on Line 6.

Register prefix token parsing procedure methods

We will introduce two functions; one that will be used for PLUS and MINUS tokens while the other will be used for BANG token.

PLUS and MINUS parsing method

Let’s create a new function name add_minus_or_plus_as_prefix which will instantiate the ParserType object for either PLUS or MINUS token and inject the appropriate NUD function.

fn add_minus_or_plus_as_prefix(&mut self, symbol: IdentifierKind) {
    let obj = ParserType {
            nud_fn: Some(Self::parse_minus_or_plus_as_prefix),
            led_fn: None,
            binding_power: Some(10),
    };
    self.register(symbol, obj);
}

Similar to other register functions we have seen so far. Next we will add the parse_minus_or_plus_as_prefix function which will handle the actual parsing.

pub(crate) fn parse_minus_or_plus_as_prefix(tok: Token,index: usize,bucket: Rc<RefCell<Vec<Token>>>) -> Result<(Objects, usize), errors::KarisError> {
    todo!("add implementation here")
}

The first step is to access the next token from the bucket. We do this to assert that the current token, which is either a PLUS or a MINUS is adjacent to an INT token. Any other kind of token would be considered an invalid syntax.

let borrow = bucket.borrow();

let next_token = borrow.get(index + 0x01);
if next_token.is_none() {
    return Err(errors::KarisError {
        error_type: errors::KarisErrorType::InvalidSyntax,
        message: format!(
            "[INVALID SYNTAX] Syntax not correct. Expected something after `{}. Ln {} Col {}",
            tok.literal,tok.line_number,tok.column_number
        ),
    });
}

let next_token = next_token.unwrap();
if next_token.token_type != IdentifierKind::INTLITERAL {
    return Err(errors::KarisError {
        error_type: errors::KarisErrorType::InvalidSyntax,
        message: format!(
            "[INVALID SYNTAX] Syntax not correct after `{}`. Expected integer. Ln {} Col {}",
            tok.literal, tok.line_number, tok.column_number
        ),
    });
}

Once we have proven that the syntax, we will add an extra layer of validation to assert that the actual value wrapped in the Token object can be converted into an integer. Consider this defensive programming to make our parser more error-proof.

let as_isize = next_token.literal.parse::<isize>();
    if as_isize.is_err() {
        return Err(errors::KarisError {
        error_type: errors::KarisErrorType::UnableToConvert,
            message: format!(
                "[FAILED CONVERSION] Failed to convert to isize ; Token {:?} Ln {} Col {}",
                next_token.literal, next_token.line_number, next_token.column_number
            ),
    });
}

In this scenario we are dealing with prefix operators of kind PLUS or MINUS. Therefore, if the current token does not match our expectation we should return an error meaning the syntax is incorrect. To do this we use Rust’s match pattern.

match tok.token_type {
    IdentifierKind::PLUS | IdentifierKind::MINUS => {
        todo!("add implementation here")
    }
    _ => Err(errors::KarisError {
            error_type: errors::KarisErrorType::InvalidSyntax,
            message: format!(
                "[INVALID SYNTAX] Syntax not correct. Expected `-` or `+` `{}. Ln {} Col {}",
                tok.literal, tok.line_number, tok.column_number
            ),
    }),
}

The final step is to create an appropriate Node object and return it back to the abstract syntax tree.

let value_fn = || {
    let v = as_isize.unwrap();
    if tok.token_type == IdentifierKind::PLUS {
        v
    } else {
        -v
    }
};

let int = integerValue { value: Some(value_fn()),};
let obj = LiteralObjects::ObjintegerValue(int);
let node = Node {
    identifier_kind: Some(IdentifierKind::INTLITERAL),
    left_child: Some(Left(obj)),
    ..Default::default()
};

Ok((Objects::TyNode(node), index + 0x01))

BANG parsing method

Similarly, we begin by registering add_bang_as_prefix function which will instantiate the ParserType object for the BANG token.

fn add_bang_as_prefix(&mut self, symbol: IdentifierKind) {
    let obj = ParserType {
        nud_fn: Some(Self::parse_bang_as_prefix),
        led_fn: None,
        binding_power: Some(10),
    };
    self.register(symbol, obj);
}

Next we proceed to introduce the parse_bang_as_prefix function. Parsing a BANG token involved two main steps:

pub(crate) fn parse_bang_as_prefix(tok: Token, index: usize, bucket: Rc<RefCell<Vec<Token>>> ) -> Result<(Objects, usize), errors::KarisError> {
    let borrow = bucket.borrow();
    let next_token_index = index + 0x01;
    let next_token = borrow.get(index + 0x01);
    if next_token.is_none() {
        return Err(errors::KarisError {
            error_type: errors::KarisErrorType::InvalidSyntax,
            message: format!(
                "[INVALID SYNTAX] Syntax not correct. Expected something after `{}. Ln {} Col {}  '",
                tok.literal,tok.line_number,tok.column_number
            ),
        });
    }
}

First we begin by extracting the next token from the bucket and verifying it’s presence. Recall that we intend our parser to be error-proof therefore checking for potential failure case is wise. The final step is to match the found token to our expected scenarios.

let next_token = next_token.unwrap();
match next_token.token_type {
    IdentifierKind::TRUE | IdentifierKind::FALSE => {
        let (bool_object, last_index) = Self::parse_boolean_literal(
            next_token.clone(),
            next_token_index,
            bucket.clone(),
        )?;

        let node = Node {
            identifier_kind: Some(IdentifierKind::BANG),
            right_child: Some(Right(Box::new(bool_object))),
            ..Default::default()
        };

        Ok((Objects::TyNode(node), last_index))
    }

    IdentifierKind::BANG => {
        let (bang_object, last_index) = Self::parse_bang_as_prefix(
            next_token.clone(),
            next_token_index,
            bucket.clone(),
        )?;

        let node = Node {
            identifier_kind: Some(IdentifierKind::BANG),
            right_child: Some(Right(Box::new(bang_object))),
            ..Default::default()
        };

        Ok((Objects::TyNode(node), last_index))
    }

    _ => Err(errors::KarisError {
                error_type: errors::KarisErrorType::InvalidSyntax,
                message: format!(
                    "[INVALID SYNTAX] Syntax not correct after `{}` Expected boolean. Ln {} Col {}",
                    tok.literal, tok.line_number, tok.column_number
        ),
    }),
}

With that we can now parse prefix operators.

In Karis, we create union expressions with parentheses to influence their precedence and thus the order in which they are evaluated in their context. We’ve seen this before:

(10 + 4) \ 2;

The parentheses group the 10 + 4 expressions in order to give them higher precedence and result in the correct evaluation order for this mathematical expression.

Register prefix token parsing procedure methods

The parser treats the LPAREN and RPAREN tokens differently. The’ LPAREN’ will have both the led and nud functions because the token can appear at the beginning of an expression or an infix. Consider the examples below:

(10 + 4) \ 2;
10 + (4 \ 2);

In Line 1 we see that the ( starts the expression. In this case, its nud function will be called instead of the led function. In Line 2 it starts after the PLUS token. Here we treat it as an infix operator hence the led function will be called. Let’s add the functions:

fn add_opening_parenthesis(&mut self, symbol: IdentifierKind, binding_power: usize) {
    let obj = ParserType {
            nud_fn: Some(Self::parse_opening_parenthesis),
            led_fn: Some(Self::parse_infix_operator),
            binding_power: Some(binding_power),
    };
    self.register(symbol, obj);
}

fn add_closing_parenthesis(&mut self, symbol: IdentifierKind, binding_power: usize) {
    let obj = ParserType {
            nud_fn: None,
            led_fn: Some(Self::parse_closing_parenthesis),
            binding_power: Some(binding_power),
    };
    self.register(symbol, obj);
}

In Line 1 and Line 9 we pass in the binding power for each token. This is important because we need to specify the value of the binding power that will give LPAREN and RPAREN higher precedence. In Line 3 and Line 4 we pass the appropriate nud and led for the LPAREN token. In Line 12 we only pass the led function for the RPAREN token. Next, we introduce the parsing functions.

Opening parenthesis NUD and LED functions

We begin with the parse_opening_parenthesis function.

pub(crate) fn parse_opening_parenthesis(tok: Token,index: usize,bucket: Rc<RefCell<Vec<Token>>>) -> Result<(Objects, usize), errors::KarisError> {
    todo("add implementation here");
}

We first extract the next token from the bucket and check whether it is present. In the event the next token is missing, we return an error which means that the syntax is incorrect.

let borrow = bucket.borrow();

let next_token = borrow.get(index + 0x01);
if next_token.is_none() {
    return Err(errors::KarisError {
        error_type: errors::KarisErrorType::InvalidSyntax,
        message: format!(
            "[INVALID SYNTAX] Syntax not correct. Expected something after `{}. Ln {} Col {}  '",
            tok.literal,tok.line_number,tok.column_number
        ),
    });
}

Next we move the cursor to the right recursively until we reach the RPAREN token. If we fail to find a RPAREN token it means our syntax is incorrect. This also allows us to extract sub-groups of union expressions. For instance:

((5 + 5) + 4) \ 2

The LPAREN token at position one corresponds to the RPAREN token at position nine. Therefore while moving the cursor to the right and we encounter the inner LPAREN token, the parser will break and jump in to parse the inner union expression.

Self::traverse_forward_until(tok.clone(), index, bucket.clone(), IdentifierKind::RPAREN)?;

We add the traverse_forward_until function which moves the cursor forward to the right. If no error is encountered whether the next token is an INTLITERAL or LPAREN token. This is important because Karis has a limitation of only allowing integers and call expressions inside union expressions. For instance, the expression (true && false) is invalid. Likewise, the expression ("name") is invalid as well.

let next_token = next_token.unwrap();
if next_token.token_type == IdentifierKind::INTLITERAL || next_token.token_type == IdentifierKind::LPAREN {
    todo("add implementation here");
} else {
    Err(errors::KarisError {
        error_type: errors::KarisErrorType::InvalidSyntax,
        message: format!(
            "[INVALID SYNTAX] Syntax not correct. Expected INT or CALLER after `{}. Ln {} Col {}  '",
            tok.literal,tok.line_number,tok.column_number
        ),
    })
}

From Line 5 to Line 11 we return an error if the expression inside the LPAREN and RPAREN tokens is not what we expect. Next, we call the expression function to now parse the expression and return it back to the abstract syntax tree.

let res = Parser::expression(0, index + 0x01, bucket.clone())?;
let object = reorganize_parenthesis_object(res.0.clone())
Ok((object, res.1))

To handle the null denotation we reuse the parse_infix_operator function. However, we need to update it to accommodate the presence of an LPAREN token.

let right = Self::parse_opening_parenthesis(
    current_token.clone(),
    token_index,
    bucket.clone(),
)?;

let node = Node {
    identifier_kind: Some(current_token.token_type),
    left_child: Some(Right(Box::new(left))),
    right_child: Some(Right(Box::new(right.0))),
    ..Default::default()
};

let res = Objects::TyNode(node);
Ok((res, right.1))

In Line 1 once we encounter an LPAREN token we call its parse function. This is exactly what we meant by parsing inner union expressions. Finally, we wrap the resulting object and return an appropriate node object back to the abstract syntax tree.

Closing parenthesis LED function

We introduce the parse_closing_parenthesis function which first checks the first token and asserts if it’s a closing parenthesis then wraps the left context object as a node object.

 pub(crate) fn parse_closing_parenthesis(left: Objects, token_index: usize,
    bucket: Rc<RefCell<Vec<Token>>>) -> Result<(Objects, usize), errors::KarisError> {
    let borrow = bucket.borrow();
    let current_token = borrow.get(token_index).unwrap();

    if current_token.token_type != IdentifierKind::RPAREN {
        return Err(errors::KarisError {
            error_type: errors::KarisErrorType::InvalidSyntax,
            message: format!(
                "[INVALID SYNTAX] Syntax not correct. Expected `)` found {}. Ln {} Col {}  '",
                current_token.literal, current_token.line_number, current_token.column_number
            ),
        });
    }

    let node = Node {
        identifier_kind: Some(IdentifierKind::GROUPING),
        right_child: Some(Right(Box::new(left))),
        ..Default::default()
    };

    Ok((Objects::TyNode(node), token_index))
}

In Line 3 to Line 4 we extract the current token and check whether it’s an RPAREN. If not we return an error, otherwise we return a node object of kind GROUPING thus closing the expression.

Share

© Mwangi Kariuki 2019-2024