January 5, 2023

Crafting a compiler in Rust : Chapter 12

Have you ever wondered how programming languages like Python, Java or Ruby work? Do you have an interest in building your own language? If so, this series of posts is for you. We dive into building a novel programing language from first principles. Get ready to have your hands dirty. It’s going to be a fun ride. Access the final product here : https://github.com/daviddexter/karis-programming-language

Overview of arrays in Karis

Arrays in Karis are opened and closed using square brackets. Recall that we introduced square brackets earlier in the lexer with the sole purpose of supporting arrays. The square brackets enclose both the array item and its value. Below is an example of array syntax in Karis.

let numbers [ @int ] = [ 1, 2, 3, 4, 5 ];

Notice the spacing between the type and the values. This is intentional to make the syntax unique to Karis.

Lexer implementation

We begin by introducing a change in the lexer.

tokens::LSQUAREBRACE => Ok(tokens::Token::new(
                        tokens::IdentifierKind::LSQUAREBRACE,
                        ch_owned,
                        self.line_number,
                        self.position,
                    )),

                    tokens::RSQUAREBRACE => Ok(tokens::Token::new(
                        tokens::IdentifierKind::RSQUAREBRACE,
                        ch_owned,
                        self.line_number,
                        self.position,
                    )),

In this change, we are telling the lexer that when it encounters the [ or ] tokens, it should correctly identify them. That is all that we need to do. Pretty simple, right!. Let’s verify that the lexer can recognize the new tokens.

#[test]
fn should_read_array3() {
        let mut lx = Lexer::new(String::from(
            "
        let numbers [ @int ] = [ 1, 2, 3 ];
        ",
        ));

        assert_eq!(
            lx.read_tokens().unwrap().token_type,
            tokens::IdentifierKind::LET
        );

        /*Trimmed for readability*/
        assert_eq!(
            lx.read_tokens().unwrap().token_type,
            tokens::IdentifierKind::SEMICOLON
        );
}

When we run the tests, the passes easy!!

The parser implementation of an array is where things get intricate. It will involve the following steps:

  1. Introduce literal_typing_match function to match the type of array items
  2. Update the the typing_kind function to match against the LSQUAREBRACE token
  3. Introduce a NUD function for the LSQUAREBRACE token
  4. Update the token registry

Introduce literal_typing_match function

pub(crate) fn literal_typing_match(tok: &Token) -> IdentifierKind {
    match tok.token_type {
            IdentifierKind::INTLITERAL => IdentifierKind::INTTYPE,
            IdentifierKind::BOOLEANLITERAL | IdentifierKind::TRUE | IdentifierKind::FALSE => {
                IdentifierKind::BOOLEANTYPE
            }
            IdentifierKind::STRINGLITERAL => IdentifierKind::STRINGTYPE,
            _ => unreachable!("invalid token type"),
    }
}

The function logic is simple; given a token, we match the literal type and return the corresponding concrete typing information.

Update the the typing_kind function

The current implementation of the typing_kind function does not anticipate the occurrence of LSQUAREBRACE. Let’s change that.

pub(crate) fn typing_kind(tok: &Token) -> TypingKind {
    match tok.token_type {
            IdentifierKind::INTTYPE => TypingKind::Int,
            IdentifierKind::BOOLEANTYPE => TypingKind::Boolean,
            IdentifierKind::STRINGTYPE => TypingKind::String,
            IdentifierKind::LSQUAREBRACE => TypingKind::Array,
            _ => TypingKind::Unknown,
    }
}

Line 6 is the line that we introduce. As you can see, when the LSQUAREBRACE is encountered we return an typing of kind Array

Introduce a NUD function for the LSQUAREBRACE token

This is the most involving part of the change. Let’s begin by setting up the function.

pub(crate) fn parse_opening_array(tok: Token,index: usize,bucket: Rc<RefCell<Vec<Token>>>) -> Result<(Objects, usize), errors::KarisError> {
    let last_item_index = Self::traverse_forward_until(tok.clone(), index, bucket.clone(),IdentifierKind::RSQUAREBRACE, )?;
    let rsquare_index = last_item_index;
}

In Line 2, we traverse forward until we find the RSQUAREBRACE token. This indicates that we have the end of the array definition. In the event the token is not found, it means the syntax is incorrect.

let array_typing_info_index = index - 0x03;
let array_typing_info = borrow.get(array_typing_info_index).unwrap(); // safe to unwrap here

let items = borrow.get(index + 0x01..rsquare_index).unwrap();
let items = items.iter().filter(|item| item.token_type != IdentifierKind::COMMA).collect::<Vec<&Token>>();

In Line 1 we retrieve the typing information of the array. This is type that items or the arrat should be. In Line 4 we retrieve the items individual items of the array then filter to remove any commas between the items.

let mut array_items: Vec<Objects> = Vec::new();
for item in items.iter() {
    let item = *item;
    if Self::literal_typing_match(item) != array_typing_info.token_type {
        return Err(errors::KarisError {
                    error_type: errors::KarisErrorType::InvalidSyntax,
                    message: format!(
                        "[INVALID SYNTAX] Syntax not correct. Expected typing of {:?}. Ln {} Col {}  '",
                        array_typing_info.token_type ,item.line_number,item.column_number
                    ),
        });
    }
}

In Line 1 we instantiate an empty vector that will store the result of parsing each array item. We then iterate over the array items and check that they match the array typing information as seen in Line 3. If the typing information doesn’t match, we return an error.

match item.token_type {
    IdentifierKind::INTLITERAL => {
                    let int = Self::parse_int_literal(item.clone(), index, bucket.clone())?;
                    array_items.push(int.0);
    }
    IdentifierKind::BOOLEANLITERAL | IdentifierKind::TRUE | IdentifierKind::FALSE => {
                    let int = Self::parse_boolean_literal(item.clone(), index, bucket.clone())?;
                    array_items.push(int.0);
    }
    IdentifierKind::STRINGLITERAL => {
                    let obj = StringValue {
                        value: Some(item.literal.clone()),
                    };
                    let literal = LiteralObjects::ObjStringValue(obj);

                    let node = Node {
                        identifier_kind: Some(IdentifierKind::STRINGLITERAL),
                        left_child: Some(Left(literal)),
                        ..Default::default()
                    };
                    let obj_type = Objects::TyNode(node);

                    array_items.push(obj_type);
    }

    _ => {
                    return Err(errors::KarisError {
                         error_type: errors::KarisErrorType::InvalidSyntax,
                         message: format!(
                             "[INVALID SYNTAX] Syntax not correct. Expected something after `{}. Ln {} Col {}  '",
                             tok.literal,tok.line_number,tok.column_number
                         ),
                     });
    }
}

When the typing check succeeds, we parse each token item and insert it into the array_items defined earlier.

array_items = array_items.iter().filter(|item| !item.is_ty_unknown()).cloned().collect::<Vec<Objects>>();
let node = Node {
    identifier_kind: Some(IdentifierKind::ARRAY),
    array_type: Some(Self::typing_kind(array_typing_info)),
    block_children: Some(array_items),
    ..Default::default()
};
Ok((Objects::TyNode(node), rsquare_index + 0x01))

In some cases, parsing the array items return an UNKNOWN token kind. We, therefore, filter the array_items vector to remove an occurrence of UNKNOWN. We then construct a node object with the array items as block children and ARRAY as the identifier kind. Finally, we move the cursor forward by one step so that we move the token after the RSQUAREBRACE.

Update the token registry

The final step is to update the token registry. We introduce a function that will call parse_opening_array

fn add_opening_array(&mut self) {
    let obj = ParserType {
            nud_fn: Some(Self::parse_opening_array),
            led_fn: None,
            binding_power: Some(0x00),
    };
    self.register(IdentifierKind::LSQUAREBRACE, obj);
}

We see that the binding power is set to zero. This will give it lower precedence over other tokens. With the function defined, we add it to the logic of get_token_registry so that it can be called whenever an LSQUAREBRACE is encountered.

Verification test

Just to make sure the parser works properly on arrays, let’s add a simple test:

#[test]
fn should_parse_array() {
    let lx = Lexer::new(String::from("let items [ @int ] = [ 1, 2, 3 ];"));
    let mut parser = Parser::new(lx);
    let res = parser.parse(Some("should_parse_array.json"));
    assert!(res.is_ok());
}

Let’s begin by studying the AST of an array type. Consider the following snippet:

let numbers [ @int ] = [ 1,2,3,4,5 ];

It will produce the following AST tree.

{
  "TyProgram": {
    "body": [
      {
        "TyNode": {
          "variable_name": null,
          "return_type": null,
          "identifier_kind": "ASSIGN",
          "alternate": null,
          "block_children": null,
          "left_child": {
            "TyNode": {
              "variable_name": "numbers",
              "return_type": "Array",
              "identifier_kind": "LET",
              "alternate": null,
              "block_children": null
            }
          },
          "right_child": {
            "TyNode": {
              "variable_name": null,
              "return_type": null,
              "identifier_kind": "ARRAY",
              "alternate": null,
              "block_children": [
                {
                  "TyNode": {
                    "variable_name": null,
                    "return_type": null,
                    "identifier_kind": "INTLITERAL",
                    "alternate": null,
                    "block_children": null,
                    "left_child": {
                      "ObjintegerValue": {
                        "value": 1
                      }
                    }
                  }
                },
                {
                  "TyNode": {
                    "variable_name": null,
                    "return_type": null,
                    "identifier_kind": "INTLITERAL",
                    "alternate": null,
                    "block_children": null,
                    "left_child": {
                      "ObjintegerValue": {
                        "value": 2
                      }
                    }
                  }
                },
                {
                  "TyNode": {
                    "variable_name": null,
                    "return_type": null,
                    "identifier_kind": "INTLITERAL",
                    "alternate": null,
                    "block_children": null,
                    "left_child": {
                      "ObjintegerValue": {
                        "value": 3
                      }
                    }
                  }
                },
                {
                  "TyNode": {
                    "variable_name": null,
                    "return_type": null,
                    "identifier_kind": "INTLITERAL",
                    "alternate": null,
                    "block_children": null,
                    "left_child": {
                      "ObjintegerValue": {
                        "value": 4
                      }
                    }
                  }
                },
                {
                  "TyNode": {
                    "variable_name": null,
                    "return_type": null,
                    "identifier_kind": "INTLITERAL",
                    "alternate": null,
                    "block_children": null,
                    "left_child": {
                      "ObjintegerValue": {
                        "value": 5
                      }
                    }
                  }
                }
              ]
            }
          }
        }
      }
    ],
    "non_root": false
  }
}

As you can see, the array node has children. These children represent items that belong to the array object. Therefore, evaluating the array itself involves iterating over the children and evaluating them individually.

Evaluation Implementation

The implementation logic is quite simple. Let’s see it in action.

IdentifierKind::ARRAY => {
    let mut result = Vec::new();
    for child in self.block_children.as_ref().unwrap() {
        if let Ok(res) = child.eval(scope.clone(), global.clone()) {
            result.push(res);
        }
    }
    Ok(EvaluationObject::Array(result))
}

In Line 2 we instantiate an empty vector. This vector will store the result after evaluating every item of the array. In Line 8 we return the result of the evaluation back to the caller. We are done. Let’s write a test to validate the implementation.

#[test]
fn should_evaluate_array() {
    let lx = Lexer::new(String::from(
            "
        let numbers [ @int ] = [ 1,2,3,4,5 ];

        ",
    ));

    let global_binding_resolver = hashbrown::HashMap::new();
    let mut parser = Parser::new(lx);
    let res = parser.parse(Some("should_evaluate_array.json"));
    let mut evaluator = Evaluator::new(parser);
    evaluator.repl_evaluate_program(Rc::new(RefCell::new(global_binding_resolver)));
    assert!(res.is_ok());
}
Share

© Mwangi Kariuki 2019-2024