December 29, 2022

Crafting a compiler in Rust : Chapter 9

Have you ever wondered how programming languages like Python, Java or Ruby work? Do you have an interest in building your own language? If so, this series of posts is for you. We dive into building a novel programing language from first principles. Get ready to have your hands dirty. It’s going to be a fun ride. Access the final product here : https://github.com/daviddexter/karis-programming-language

Overview of parsing conditional expressions

Parsing conditional expressions is slightly different from the expressions we have encountered so far. Let’s examine how a conditional expression looks like:

if 5 > 3 {
    return true;
}else {
    return false;
}

In Line 1 we see the expression begins with the IF token followed by a condition statement then breaks into a block with the LBRACE token. In Line 3 the alternate condition begins with the ELSE token and continues to break into a block. Therefore the steps to parse a conditional expression begins with breaking each part of the expression into it’s on its own distinct “program” and then merging them to form a node object. In the Program type, there is a non_root field which we will specify for every expression part.

Register parsing procedure methods

We introduce three parsing procedures, one for the LBRACE token, one for the IF token and the other for the ELSE token.

fn add_if_statement(&mut self) {
    let obj = ParserType {
        nud_fn: Some(Self::parse_if_else_expressions),
        led_fn: None,
        binding_power: Some(10),
    };
    self.register(IdentifierKind::IF, obj);
}

fn add_else_statement(&mut self) {
    let obj = ParserType {
            nud_fn: Some(Self::parse_if_else_expressions),
            led_fn: None,
            binding_power: Some(10),
    };
    self.register(IdentifierKind::ELSE, obj);
}

fn add_left_brace_statement(&mut self) {
    let obj = ParserType {
        nud_fn: Some(Self::parse_block),
        led_fn: None,
        binding_power: Some(10),
    };
    self.register(IdentifierKind::LBRACE, obj);
}

IF and ELSE token

Both tokens use the same parsing function parse_if_else_expressions. Let’s add it.

pub(crate) fn parse_if_else_expressions(tok: Token,index: usize, bucket: Rc<RefCell<Vec<Token>>>) -> Result<(Objects, usize), errors::KarisError> {
    todo!("add implementation here");
}

The first step is to check whether the current token is either an IF or ELSE. Any other token would be considered invalid.

let borrow = bucket.borrow();
if borrow.get(0x0).is_none() {
    return Err(errors::KarisError {
        error_type: errors::KarisErrorType::MissingConditionalIndentifier,
        message: format!(
            "[MISSING CONDITIONAL IDENTIFIER] Expected to find `if` or `else` ; Token {:?} Ln {} Col {}",
            tok.literal, tok.line_number, tok.column_number
        ),
    });
}

Next, we add another validation of checking whether there is an LBRACE token. We do this by moving the cursor forward until we find the token. In the event the token is absent we return an error indicating the source text has an invalid syntax.

let mut end_index: usize;

#[allow(clippy::redundant_clone)]
let index_at_if_lbrace = Self::traverse_forward_until(
    tok.clone(),
    index,
    bucket.clone(),
    IdentifierKind::LBRACE,
)?;

We then extract the tokens between the IF token and LBRACE and parse them in isolation.

let items_before_if_lbrace = borrow.get(index + 0x01..index_at_if_lbrace).unwrap();
let exp_vec_tokens = Vec::from(items_before_if_lbrace);
let expression_node = Parser::default()
        .program_as_non_root()
        .parse_from_vec(exp_vec_tokens)?;

In Line 1 we index into the bucket using the current index and the index of the LBRACE token. In Line 2 we then create a new bucket and call the parse_from_vec function of the Parser object which parses the tokens and returns an appropriate expression object. Next, we retrieve the index of the RBRACE token closing the IF block. We then extract all the tokens with the block, construct a new bucket and call the parse_from_vec function of the `Parser.

let index_at_if_rbrace = Self::traverse_forward_until(
            tok.clone(),
            index,
            bucket.clone(),
            IdentifierKind::RBRACE,
        )?;

        let items_after_lbrace = borrow
            .get(index_at_if_lbrace + 0x01..index_at_if_rbrace)
            .unwrap();
        let if_block_vec_tokens = Vec::from(items_after_lbrace);
        let if_block_node = Parser::default()
            .program_as_non_root()
            .parse_from_vec(if_block_vec_tokens)?;

We check for the presence of an ELSE token and do the same thing we have done above. Finally, we wrap all objects and return the corresponding node object.

let node = Node {
    identifier_kind: Some(IdentifierKind::IF),
    right_child: Some(Right(Box::new(expression_node))),
    alternate: alternate_node,
    block_children: Some(Vec::from([if_block_node])),
    ..Default::default()
};

Ok((Objects::TyNode(node), end_index))

Overview of parsing functions expressions

A function in Karis begins with the keyword fn then followed by a LPAREN and optionally arguments. Therefore parsing a function will follow the following steps:

Register parsing procedure methods

We start by registering the appropriate parsing functions similar to what we have done for the previous literals and expressions.

fn add_function_declaration(&mut self) {
    let obj = ParserType {
        nud_fn: Some(Self::parse_function_definition),
        led_fn: None,
        binding_power: Some(70),
    };
    self.register(IdentifierKind::FUNCTION, obj);
}

We continue with now introducing the parse_function_definition which will accomplish the actual parsing.

pub(crate) fn parse_function_definition(tok: Token,index: usize,bucket: Rc<RefCell<Vec<Token>>>) -> Result<(Objects, usize), errors::KarisError> {
    todo("add implementation here")
}

Like defensive engineers which we are, we first validate that the function definition is correct before we proceed to parse.

let borrow = bucket.borrow();
let next_index = index + 0x01;
let next_token = borrow.get(next_index);
if next_token.is_none() {
    return Err(errors::KarisError {
        error_type: errors::KarisErrorType::InvalidSyntax,
        message: format!(
            "[INVALID SYNTAX] Syntax not correct. Expected something after `{}. Ln {} Col {}  '",
            tok.literal,tok.line_number,tok.column_number
        ),
    });
}

In Line 1 to Line 3 we get a copy of the bucket and increment the current index by one then use the new index to retrieve the next token. We then validate whether there is actually a next token in the bucket. If the validation fails we return an error. This validation will catch incorrect syntax like:

let echo @string = fn

The next validation is to ensure that the next token is an LPAREN token.

let next_token = next_token.unwrap();
if next_token.token_type != IdentifierKind::LPAREN {
    return Err(errors::KarisError {
        error_type: errors::KarisErrorType::InvalidSyntax,
        message: format!(
            "[INVALID SYNTAX] Syntax not correct. Expected `(` after `{}. Ln {} Col {}  '",
            tok.literal, tok.line_number, tok.column_number
        ),
    });
}

If the validation checks out, we proceed to call the helper function collect_function_definition_args. This function will recursively move the cursor to the right collecting and parsing arguments that may be enclosed between the LPAREN and RPAREN tokens. It will return the arguments as a vector and the position of the LBRACE token.

let (function_args, lbrace_index) = Self::collect_function_definition_args(
    next_index + 0x01,
    bucket.clone(),
    Vec::new(),
    0x00,
)?;

We add another validation this time checking whether the function has a body. A function without a body is invalid.

let lbrace = borrow.get(lbrace_index).unwrap();
    if lbrace.token_type != IdentifierKind::LBRACE {
        return Err(errors::KarisError {
            error_type: errors::KarisErrorType::InvalidSyntax,
            message: format!(
                "[INVALID SYNTAX] Syntax not correct. Expected left-brace. Found {}. Ln {} Col {}  '",
                lbrace.literal, lbrace.line_number, lbrace.column_number
        ),
    });
}

Once the validation succeeds, we they jump inside the block and again call the collect_function_definition_args helper function to collect and parse expressions in the body.

let (block_children, last_index) =  Self::collect_block_children(lbrace_index, bucket.clone(), Vec::new(), 0x00)?;

With that now we construct a node object representing the function definition passing in the appropriate arguments and children.

let node = Node {
    identifier_kind: Some(IdentifierKind::FUNCTION),
    func_params: Some(function_args),
    block_children: Some(block_children_filtered),
    ..Default::default()
};

Ok((Objects::TyNode(node), last_index))

We are done parsing functions.

Overview of parsing function call expressions

A call expression begins with what the lexer will identify as a VARIABLE. This is just a string then followed by LPAREN and RPAREN tokens. Optionally, a function call expression can have call parameters that are passed to the function. Here are some examples:

// Without parameter
add()

// Parameters as literal
add(1, 2, 3)

// Parameters as function calls
add(one(),two())

// Parameters as variables
add(x,y)

Therefore, parsing a function call expression involves the following steps:

Register parsing procedure methods

Similar to what we have so far, we first register the appropriate parsing methods.

fn add_call_declaration(&mut self) {
    let obj = ParserType {
            nud_fn: Some(Self::parse_function_call),
            led_fn: None,
            binding_power: Some(70),
    };
    self.register(IdentifierKind::CALLER, obj);
}

We now introduce the parse_function_call associated function which will handle the actual parsing.

pub(crate) fn parse_function_call(tok: Token,index: usize,bucket: Rc<RefCell<Vec<Token>>>) -> Result<(Objects, usize), errors::KarisError> {
    todo!("add implementation here")
}

As it has become a custom in this course, we first add a few validations to ensure that we working with the correct syntax. From an engineering perspective, such defensive programming is advisable to make our entire program somewhat error-proof.

let borrow = bucket.borrow();
let next_index = index + 0x01;
let next_token = borrow.get(next_index);
if next_token.is_none() {
    return Err(errors::KarisError {
        error_type: errors::KarisErrorType::InvalidSyntax,
        message: format!(
            "[INVALID SYNTAX] Syntax not correct. Expected something after `{}. Ln {} Col {}  '",
            tok.literal,tok.line_number,tok.column_number
        ),
    });
}

In Line 1 to Line 3 we retrieve the next token the check whether there is actually a next token. Afterwards, we examine the next token to see if it is an LPAREN token. Recall that from the parser’s perspective the current token is a VARIABLE token. The only way for it to be aware if it is handling a function call expression is if a ( is adjacent to the variable. Otherwise, the variable will treat like a variable only.

let next_token = next_token.unwrap();
if next_token.token_type != IdentifierKind::LPAREN {
    return Err(errors::KarisError {
        error_type: errors::KarisErrorType::InvalidSyntax,
        message: format!(
            "[INVALID SYNTAX] Syntax not correct. Expected `(` after `{}. Ln {} Col {}  '",
            tok.literal, tok.line_number, tok.column_number
        ),
    });
}

Next, we call the helper function collect_function_call_params which will move the cursor to the right until it reaches the closing RPAPEN while collecting and parsing any available parameters.

let (params, closing_paren_index) =
            Self::collect_function_call_params(next_index, bucket.clone(), Vec::new(), 0x00)?;

The actual implementation of the collect_function_call_params function is quite involving. See the full implementation at the end of the lesson for details. Once we have parsed all parameters, we then construct an appropriate node object representing the function call. We should however be aware that Karis has built-in functions that we need to handle as well. For now, however, we will leave a todo to remind us to handle such functions later in the course.

let node = if tok.token_type == IdentifierKind::PRINT || tok.token_type == IdentifierKind::FORMAT {
            todo!("add implementation here")
} else {
    Node {
        variable_name: Some(tok.literal),
        identifier_kind: Some(IdentifierKind::CALLER),
        call_params: Some(params),
        ..Default::default()
    }
};

Ok((Objects::TyNode(node), closing_paren_index))

That concludes our implementation for parsing functions calls.

Share

© Mwangi Kariuki 2019-2024