November 2, 2022

Crafting a compiler in Rust : Chapter 6

Have you ever wondered how programming languages like Python, Java or Ruby work? Do you have an interest in building your own language? If so, this series of posts is for you. We dive into building a novel programing language from first principles. Get ready to have your hands dirty. It’s going to be a fun ride. Access the final product here : https://github.com/daviddexter/karis-programming-language

Introduction to parsers

A critical step in language interpretation recognizes the kinds of units a statement is composed of. To recognize a unit means two things:

1 - Distinguish a unit construct from other constructs allowed by the language 2 - Identify elements and substructures in the statement

For instance, if we identify a statement as an assignment we can distinguish the units on the left and right of the assignment unit =. In principle, we intend to take a sequence of unit constructs and build a data structure representing the source text. The parser is the tool that we will use to achieve the goal of creating a data structure representative of the source text. The data structure can be either a parse tree, abstract tree or any other hierarchical structure. A parser works in the company of a lexer which splits the source text into identifiable unit constructs.

Let’s illustrate this with an example. Consider a JSON string in javascript:

const input = '{"language": "karis", "version": "v0.0.1" }';
const output = JSON.parse(input);

The input is just a string. When we pass it to the parser behind the JSON.parse function, we get back a JavaScript object. The parser has converted the input to a data structure that can be used for some other purposes.

> output
{language: 'karis', version: 'v0.0.1'}


> output.language
'karis'

> output.version
'v0.0.1'

That is exactly what we want to achieve with Karis.

Parser data structure

In most language parsers, the most common data structure used is an Abstract syntax tree (AST). Like a tree, an AST has leaves (also referred to as nodes) that represent constructs identified. Depending on the implementation details of the parser, constructs such as semicolons, newlines, whitespace, comments, braces and parentheses, may be present in the AST.

Parsers are a well understood problem in computer science. As such the common advice received when facing a parsering problem is to use parser generator. A parser generator is a tools that when fed with a formal description of a language, it produces the corresponding parser. This parser that is generated can be used to produce syntax tree when fed some source code. Promonents examples of parser generators are bison, yacc and ANTLR.

At their core, parser generators are similar, differing in the format of input they accept and the format of output they produce. Majority, if not all parser generators use context-free grammar (CFG) as the format of their input.

Simple context-free grammar

Let’s write a simple grammar definition for a language arithmetic expression. We will focus only on addition and multiplication operations. First, we write the definition in pseudocode as below:

  1. An integer is an expression

  2. If exp1 and exp2 are expressions, so are :

    • exp1 + exp2
    • exp1 * exp2
    • (exp1 + exp2)
    • (exp1 * exp2)

The corresponding context-free grammar would look something like:

exp --> INTLITERAL
exp --> exp ADD exp
exp --> exp MULTIPLY exp
exp --> LPAREN exp ADD exp RPAREN
exp --> LPAREN exp MULTIPLY exp RPAREN

From the definition above, we can deduce that the language has :

  1. Five unit tokens. Namely INTLITERAL, ADD, MULTIPLY,LPAREN and RPAREN.
  2. Five rules each of the form : exp --> ?

Why not use a parser generator of Karis

Write a robust parser is a rather involving task that requires understanding of established techniques. It is even more complicated if the syntax of the language we are writing the parser for is unusual. While it may be the right choose to use a parser generator, our intention is to learn. Getting a feel of how parsers are built, understanding the nitty-gritty detail will give us further appreciation of programming langauge implementation.

There are multiple strategies to parse a language depending on the complexity of the language. As with every decision in engineering, there are trade-offs we will make specifically; simple or effective. Typically, languages like C++ require less efficient but more powerful parsing strategies. Our choice of strategies boils down to two: Bottom-up parsing or Top-down parsing.

Bottom-up parsing

Bottom-up parsing is a technique for analyzing the structure of a program written in a given language. It involves starting with the individual tokens that make up the program, such as keywords, operators, and identifiers, and combining them into larger structures, such as expressions, statements, and functions. This is in contrast to top-down parsing, which starts with the complete program and breaks it down into smaller units.

A bottom-up parser processes the code from left to right, and at each step, it attempts to find a sequence of tokens that can be combined into a higher-level structure. When a match is found, the tokens are replaced with the higher-level structure, and the process continues until the complete program has been parsed.

Consider the following example of grammar:

S -> cAd
A -> a
W -> cad
S represents the starting terminal while W represents the input string that we intend to parse. When the input string is fed to a bottom-up parser, it produces the following parse tree:

n-ary widget

    S
 |   |    |
 |   A    |
     |    |
          |
 |   |    |
 c   a    d

One of the benefits of bottom-up parsing is that it allows for easy error recovery. If the parser encounters an unexpected token in the code, it can easily backtrack and try to find a different sequence of tokens that make up a valid structure. This can make the parser more resilient to syntax errors and allow it to provide helpful error messages to the programmer.

Another advantage of bottom-up parsing is that it can be more efficient than top-down parsing. This is because it avoids the need to perform backtracking, which can be time-consuming. Additionally, bottom-up parsers can be implemented using a stack-based algorithm, which can be more efficient than other parsing algorithms.

Top-down parsing

Top-down parsing is a technique for analyzing the structure of a program written in a given language. It involves starting with the complete program and breaking it down into smaller units, such as expressions, statements, and functions. This is in contrast to bottom-up parsing, which starts with the individual tokens that make up the program and combines them into larger structures.

A top-down parser processes the code from left to right, and at each step, it attempts to find the next smallest unit that matches a production rule in the grammar of the language. When a match is found, the code is replaced with the non-terminal symbol associated with the production rule, and the process continues until the complete program has been parsed.

Consider the following example of grammar:

S -> aABe
A -> bc
B -> d
W -> abcd
S represents the starting terminal while W represents the input string that we intend to parse. When the input string is fed to a top-down parser, it produces the following parse tree:

n-ary widget

    S
 |   |    |  |
 a   A    B  e
    |  |  |
    b  c  d

One of the advantages of top-down parsing is that it is easier to understand and implement than bottom-up parsing. Because it starts with the complete program and works its way down, top-down parsing is typically simpler and more intuitive than bottom-up parsing. Additionally, top-down parsers can support predictive parsing, which can make the parsing process more efficient and easier to understand.

However, top-down parsing has some disadvantages compared to bottom-up parsing. For example, it can be more difficult to handle left-recursive grammars with top-down parsers, and it may require more backtracking to recover from syntax errors. Additionally, bottom-up parsers can be more efficient because they avoid the need for backtracking.

Why choose Top-down parsing for Karis

There are several advantages to using top-down parsing over bottom-up parsing in the context of programming languages. Some of these advantages include:

  1. Top-down parsers are easier to understand and implement. Because they start with the complete program and break it down into smaller units, top-down parsers are typically simpler to understand and implement than bottom-up parsers.

  2. Top-down parsers can handle left-recursive grammar more easily. Left-recursive grammars are a type of grammar in which a non-terminal symbol appears to the left of itself in a production rule. These grammars can cause problems for bottom-up parsers, but they can be handled more easily by top-down parsers.

  3. Top-down parsers can support predictive parsing. Predictive parsing is a type of parsing algorithm that can determine which production rule to use for a given non-terminal symbol without the need for backtracking. This can make the parsing process more efficient and easier to understand. Top-down parsers are well-suited to predictive parsing, while bottom-up parsers are not.

  4. Top-down parsers can support early error detection. Because top-down parsers start with the complete program and work their way down, they can detect syntax errors early on in the parsing process. This can allow for more helpful error messages and a better user experience.

Overview of operator precedence

Operator precedence is an important concept in programming language parsers. It determines the order in which operations are performed in expressions that have multiple operators. This is necessary because, without a defined order of operations, it would be difficult for a parser to accurately interpret and evaluate expressions.

In most programming languages, operators are grouped into different levels or “precedence levels.” Operators within the same precedence level are typically evaluated from left to right, while operators in higher precedence levels are evaluated before operators in lower precedence levels. For example, in the expression:

 1 + 2 * 3

The multiplication operation (which is in a higher precedence level than the addition operation) is performed first, resulting in a final value of 7.

There are a few different ways that operator precedence can be implemented in a programming language. One common approach is to use a “precedence table” that defines the precedence levels of each operator. The table may be built into the language itself, or it may be defined by the developer who is writing the parser.

Another way to implement operator precedence is through the use of parentheses. Parentheses can be used to explicitly specify the order in which operations should be performed, overriding the default precedence levels. For example, in the expression:

1 + (2 * 3)

The multiplication operation is performed first because it is enclosed in parentheses, resulting in a final value of 7.

It’s worth noting that not all programming languages have the same operator precedence levels. This can lead to confusion for developers who are accustomed to one language but are working with another. It’s important to familiarize yourself with the operator precedence rules of the language you’re using to avoid mistakes and ensure that your code is properly parsed and evaluated.

In conclusion, operator precedence is a crucial concept in programming language parsers. It determines the order in which operations are performed in expressions with multiple operators, and is typically implemented through the use of precedence tables or parentheses. Understanding operator precedence is essential for writing correct and accurate code in any programming language.

Overview of Pratt parser

A Pratt parser is a type of recursive descent parser that uses a technique called top-down operator precedence parsing. This technique allows the parser to easily handle expressions that have a mix of different operators with different precedence levels. The technique is also more flexible and efficient than traditional recursive descent parsers, which can be difficult to extend and maintain. The parser is named after its inventor, Vaughan Pratt, who first described the algorithm in a paper published in 1973.

How Pratt Parser works

In a programming language, an expression is a sequence of symbols that represents a value or operation. For example, the expression:

1 + 2

represents the addition of the values 1 and 2, and the expression:

x * y + z

represents the sum of the product of x and y and the value of z.

Expressions can be nested, so an expression can contain other expressions as operands. For example, the expression:

(x + y) * z

contains the expression x + y as an operand. This can create ambiguity in how the expression should be evaluated because the order in which the operations are performed can affect the final result.

To resolve this ambiguity, programming languages define a set of rules called operator precedence rules. These rules specify the order in which operations should be performed within an expression. For example, in the expression:

x * y + z

the multiplication operation (*) has higher precedence than the addition operation (+), so the multiplication should be performed before the addition.

A Pratt parser evaluates expressions according to their operator precedence rules. This technique involves two main steps:

The key advantage of using a Pratt parser is that it allows for a more natural and intuitive way of expressing the grammar of a programming language. Unlike other parsing techniques, such as recursive descent parsers, Pratt parsers do not require the use of complicated code for handling operator precedence and associativity. Instead, the precedence and associativity of each operator are encoded directly into the grammar, using special symbols known as led and nud functions.

The led function is used to parse the infix operators in an expression, while the nud function is used to parse the prefix operators. For example, consider the following arithmetic expression:

3 + 4 * 5

In this expression, the “+” and “*” operators are infix operators, while the “3” and “5” are prefix operators. Using a Pratt parser, we can encode the grammar for this expression as follows:

expression = nud:number, led:+, nud:number

The nud function is used to parse the numbers, while the led function is used to parse the + operator. The led function takes two arguments: the operator and the right-hand side of the expression. In this case, the + operator takes the 4 and the “* operator as its arguments, and returns the result of the addition.

Once the prefix parse phase is complete, the Pratt parser uses the precedence information to build the parse tree for the input. This is done by maintaining a stack of operators and operands, repeatedly popping operators and operands from the stack and combining them into subtrees until the entire input has been parsed.

For example, the parse tree for the above expression would be as follows:

n-tree widget


       *
         \
      4   5
      |
      +
      |
      3

Overall, the Pratt parser is a powerful and versatile tool for the implementation of programming languages. Its combination of recursive descent and operator precedence parsing allows it to handle a wide range of language constructs, and its simplicity and efficiency make it a popular choice among language designers.

Overview of how the parser work

Since we are the user of the parser, the question to ask ourselves is how would we want to consume the parser. The most ideal starting point would be the console. We expect that we should point the console program to a file in our local file system, ensure a parsing intent and press enter for the intent to be fulfilled. As simple as that. Let’s demonstrate our expectations with actual implementation.

Terminal widget : cp -r programs/karis-programming-language work/ && cd work/karis-programming-language && cargo install –path console/ && clear

he intent to be fulfilled. As simple as that. Let’s demonstrate our expectations with actual implementation.

Terminal widget : cp -r programs/karis-programming-language work/ && cd work/karis-programming-language && cargo install –path console/ && clear

Type the following command:

karis rppl

We see our Read-parse-print-loop program ready to accept commands. What we want is to pass the file path of a source file. The flag -p (or --filepath in full) allows us to do exactly that. Type the following command:

karis rppl -p testdata/01/test01.kr

It prints the abstract syntax tree of the source code found in the file testdata/01/test01.kr.

Setting up the read-parse-print-loop

piece of code:

code

piece of code:

code

piece of code:

code

piece of code:

code

piece of code:

code

piece of code:

code

piece of code:

code

piece of code:

code

piece of code:

code

piece of code:

code

piece of code:

code

piece of code:

code

piece of code:

code

piece of code:

code

piece of code:

code

piece of code:

code

piece of code:

code

f code:

code

.subcommand(
    Command::new("rppl")
        .about("Read-Parse-Print-Loop parsers program and prints on stdout")
        .arg_required_else_help(true)
        .arg(arg!(-p --filepath <PATH>).required(false))
        .arg(
            Arg::new("inspect")
                .short('i')
                .long("inspect")
                .action(ArgAction::SetTrue)
                .required(false),
    ),
)

Line 1 adds our new subcommand to the console program. Without this addition, the subcommand will not be available for usage. Line 2 to Line 11 involves command-specific initialization details. The most important is the name of the command which we specify in Line 2 and its arguments which we specify in Line 5 and Line 6. All arguments are marked not required. This is a design choice which will result in the program printing to the user a help menu of how to use the program. The inspect argument is a utility argument that will allow us to generate a graphical representation of the abstract syntax tree.

Next, we need a mechanism to intercept a call from the rppl command. We do this by adding a match block for the subcommand. We add the following code :

code

Some(("rppl", sub_matches)) => {
    let file_path = sub_matches.get_one::<String>("filepath");
    let inspect = sub_matches.get_one::<bool>("inspect").unwrap();

    if let Some(file_path) = file_path {
        return parser_from_file(file_path, inspect);
    }

    Ok(())
}
contains the source code. Next, we add the parser_from_file.

code

contains the source code. Next, we add the parser_from_file.

code

contains the source code. Next, we add the parser_from_file.

code

code
fn parser_from_file(file: &str, inspect: &bool) -> Result<(), KarisError> {
    let path = Path::new(file);
    let path_str = path.to_str().expect("failed to get file path");
    let file = std::fs::read_to_string(path_str)?;
    if file.is_empty() {
        println!("Nothing to parse \n");
    } else {
        let lx = lex::Lexer::new(file);
        let mut parser = Parser::new(lx);
        let res = parser.parse(Some("program_tree.json"))?;
        let inspect = *inspect;
        if inspect {
            res.inspect_and_print()?
        } else {
            println!("{:?}", res);
        }
    }
    Ok(())
}

In Line 1 to Line 3 we open and read the contents of the source file. If the file does not contain any content the parser will do nothing. It is important to note that if the source file does not exist an error will be thrown as shown below:

Error: KarisError { error_type: IO, message: "No such file or directory (os error 2)" }
appended. At this point, the parser does not exist. Let’s change that. We will create a new crate called parser and add it to our project workspace.

code - parser crate

appended. At this point, the parser does not exist. Let’s change that. We will create a new crate called parser and add it to our project workspace.

code - parser crate

not exist. Let’s change that. We will create a new crate called parser and add it to our project workspace.

code - parser crate

#[derive(Debug, Clone, Default)]
pub struct Parser {
    pub lexer: Lexer,
    pub bucket: Rc<RefCell<Vec<Token>>>,
    pub program: Program,
}

impl Parser {
    pub fn new(lexer: Lexer) -> Self {
        Self {
            lexer,
            bucket: Rc::new(RefCell::new(Vec::new())),
            program: Program::default(),
        }
    }

    pub fn parse(&mut self, json_name: Option<&str>) -> Result<Objects, errors::KarisError> {
        // add all tokens to a bucket cache. This will be used later when doing the actual parsing
        loop {
            let token = self.lexer.generate()?;
            let bucket_clone = self.bucket.clone();
            let mut token_bucket = bucket_clone.borrow_mut();
            if token.token_type == IdentifierKind::EOF {
                break;
            } else {
                token_bucket.push(token.clone());
            }
        }

        self.parse_program(json_name)
    }
}

In Line 2 to Line 6 we introduce the parser object that the console program will instantiate. As we can see in Line 3 we inject the lexer that will be responsible for generating tokens from the source code. The bucket in Line 4 will hold each token that the lexer produces. Notice that the bucket is a Rc. In Rust, each data has only one owner at a time. While building an abstract syntax tree, a piece may have multiple “owners” because of multiple references. This multiple ownership is not permitted in Rust. Rc will allow us to enable multiple ownership. The internal wrapper RefCell will give us the ability to mutate the data it holds.

Setting up the abstract syntax tree

Conceptually the abstract syntax tree is a tree-like data structure much like a binary tree or a b-tree. At the top, we have the root node and as we go down and wide, we have edge nodes. We, therefore, need to define an object or a set of objects that will represent the nodes of the tree. First, we introduce a Program which will be the root of the tree.

pub struct Program {
    pub body: Vec<Objects>,
    pub non_root: bool,
}

impl Program {
    pub fn add_object(&mut self, object: Objects) {
        self.body.push(object)
    }

    pub fn count(&self) -> usize {
        self.body.len()
    }

    pub fn toggle_root(&mut self) {
        self.non_root = true;
    }
}

In Line 2 we see that the program is a collection of other objects that have been parsed. The Objects type thus represents every node in our syntax tree. Let’s add it.

#[derive(Debug, EnumAsInner, Clone, Serialize)]
pub enum Objects {
    TyProgram(Program),
    TyNode(Node),
    TyConsumable,
    TyUnknown,
}

In our case, Objects is an enum which can be either a Program, Node, Consumable, or Unknown. We encounter each of these options as we build the parser. We need a way to tell what type of object we are dealing with. To solve this, we’ll declare a trait that each object needs to implement.

/// Declaration: an object must be able to tell what is it
/// `which` returns what the object knows about itself
pub trait Declaration {
    // returns the type of the current declaration object
    fn which(&self) -> DeclarationType;
}

#[derive(Debug, PartialEq, Eq, Clone, Serialize)]
pub enum DeclarationType {
    Unknown,
    Program,
    Node,
    Consumable,
}

impl Default for DeclarationType {
    fn default() -> Self {
        DeclarationType::Unknown
    }
}

Next, we need to declare a Node type. This type will be agnostic in usage and declaration meaning it can be used to represent any kind of token. For example:

let num @int = 10;

In the above let num @int will be treated as its unique node. This node will have an identifier kind of type LET, a variable_name of value num and a return_type of INTTYPE. The = assign will be its unique node. This node will have an identifier kind of type ASSIGN, a left_child whose value will be the node on the left-hand side, and a right_child whose value will be another node representing the literal 10. The 10 literal will be its unique node. This node will have a kind of type INTLITERAL with a left_child of type LiteralObjects and an identifier kind of type INTLITERAL. The tree of the expression will be of the form:

                 Node(=)
                /       \
           Node(LET)     Node(10)

Here is the implementation:

#[derive(Debug, Default, Clone)]
pub struct Node {
    pub variable_name: Option<String>,

    pub return_type: Option<TypingKind>,

    pub identifier_kind: Option<IdentifierKind>,

    pub array_type: Option<TypingKind>,

    /// type of `Node`
    pub left_child: Option<Either<LiteralObjects, Box<Objects>>>,

    /// the RHS can either be a literal or a node
    pub right_child: Option<Either<LiteralObjects, Box<Objects>>>,

    /// this is used to populate the params of a function definition.
    /// These params can be of different types
    /// If the function has any other return type other than `@unit`, the return type will be evaluated
    /// to match of the definition
    pub func_params: Option<Vec<Either<LiteralObjects, Objects>>>,

    /// this is used for call expressions which can take the form of
    ///       func(1, 2, 3)
    ///       func(1,a,b) where a and b are variables of the same type
    pub call_params: Option<Vec<Either<LiteralObjects, Objects>>>,

    /// alternate will be populated with an `else` part of a conditional expression
    pub alternate: Option<Box<Objects>>,

    /// this will be populated for the `fn` block, `if` block, `else` block, `main` block and
    /// items af an array e.g [ 1 , 2 , 3 ]
    pub block_children: Option<Vec<Objects>>,
}

impl Declaration for Node {
    fn which(&self) -> DeclarationType {
        DeclarationType::Node
    }
}

With that, we have the building blocks to begin parsing the source code.

Expressions vs statements

Karis is primarily an expression language. This means syntax form is value-producing at evaluation. Typically, expressions can nest other expressions. At evaluation time, each expression is evaluated independently and then combined to produce the final value. Let’s examine this further with concrete examples. Consider the following piece of Karis code:

let age @int = 5;

In the above, the variable age is assigned the value 5. This goes against the core premise of an expression which is value-producing. We can rewrite the above in a different fashion as below:

let age @int = fn(){
    return 5;
};

In the above, we are assigned the same variable age to a function that produces the value 5. This version conforms to what we expect an expression to be. In popular programming languages, the first version is typically referred to as a statement. Statements don’t evaluate anything because the actual value has already been declared. The line between expressions and statements is blurry. In the context of Karis, we choose to eliminate this distinction by having every combination treated as an expression. This notion is not radically new. Functional languages have embraced it for years. As such, we can state that principles of functional programming languages influence Karis.

Examination of expressions

Expressions in Karis come in different varieties. Karis has expressions involving infix operators:

100 + 10
100 - 20
100 * 2
100 / 5

In addition, expressions can contain prefix operators:

!true
!false
-100

Below outlines other varieties of expressions in Karis.

foo == bar
foo != bar                   Comparison expression
foo < bar
foo > bar
10 * (4 + 3)                 Group expression
((100 + 34) * 79) * 27
max(2, 3)
sub(add(5, 10), add(2, 5))   Call expression
min(50, add(3, (7 * 5)))
let age @int = 5;
let price @int = fn(){       Binding expression
    return 50;
};
Share

© Mwangi Kariuki 2019-2024