Have you ever wondered how programming languages like Python, Java or Ruby work? Do you have an interest in building your own language? If so, this series of posts is for you. We dive into building a novel programing language from first principles. Get ready to have your hands dirty. It’s going to be a fun ride. Access the final product here : https://github.com/daviddexter/karis-programming-language
Expression
function
We start by introducing an associate function of the Parser
object called expression
. In Rust, an associate function is not attached to an instance of
an object. It is synonymous with a static function in object-oriented languages. We use an associate function because we want to call expression
in a recursive manner without instantiating the Parser
object.
pub fn expression(rbp: usize,index: usize,bucket: Rc<RefCell<Vec<Token>>>) -> Result<(Objects, usize), errors::KarisError> {
todo!("implement this")
}
The argument bucket
holds all tokens as generated by the lexer. Consider a source text below:
let add @int = fn(x @int, y @int){ return x + y; };
Inside the bucket vector, we expect it to contain tokens as below:
[
Token { token_type: LET, literal: "let", line_number: 1, column_number: 2 },
Token { token_type: VARIABLE, literal: "add", line_number: 1, column_number: 6 },
Token { token_type: INTTYPE, literal: "@int", line_number: 1, column_number: 11 },
Token { token_type: ASSIGN, literal: "=", line_number: 1, column_number: 13 },
Token { token_type: FUNCTION, literal: "fn", line_number: 1, column_number: 16 },
Token { token_type: LPAREN, literal: "(", line_number: 1, column_number: 17 },
Token { token_type: VARIABLE, literal: "x", line_number: 1, column_number: 18 },
Token { token_type: INTTYPE, literal: "@int", line_number: 1, column_number: 23 },
Token { token_type: COMMA, literal: ",", line_number: 1, column_number: 24 },
Token { token_type: VARIABLE, literal: "y", line_number: 1, column_number: 26 },
Token { token_type: INTTYPE, literal: "@int", line_number: 1, column_number: 31 },
Token { token_type: RPAREN, literal: ")", line_number: 1, column_number: 32 },
Token { token_type: LBRACE, literal: "{", line_number: 1, column_number: 33 },
Token { token_type: RETURN, literal: "return", line_number: 1, column_number: 41 },
Token { token_type: VARIABLE, literal: "x", line_number: 1, column_number: 43 },
Token { token_type: PLUS, literal: "+", line_number: 1, column_number: 45 },
Token { token_type: VARIABLE, literal: "y", line_number: 1, column_number: 47 },
Token { token_type: SEMICOLON, literal: ";", line_number: 1, column_number: 48 },
Token { token_type: RBRACE, literal: "}", line_number: 1, column_number: 50 },
Token { token_type: SEMICOLON, literal: ";", line_number: 1, column_number: 51 },
Token { token_type: EOS, literal: "", line_number: 1, column_number: 52 }
]
Every time we call the expression
function, we will pass the index of the token that we intend to parse. The rbp
argument is the binding power of the right token. Using the token bucket above, if we parse the first token which is LET
, we then pass the rbp
of the next token which is VARIABLE
. The actual value of rbp
does not matter. It’s up to us as language designers to assign each token an appropriate rbp
value. Let’s now add some logic inside the function.
let token = &bucket.borrow()[index];
if token.token_type == IdentifierKind::EOS {
return Ok((Objects::TyConsumable, index));
}
In Line 1 we extract the token in focus and try to figure out whether it’s of type EOS
. We do this first because the EOS
token has no meaning to the Abstract syntax tree. Let’s take a step back to understand the core of the Pratt parser. Instead of using a dictionary of vocabulary, the Pratt parser allows us to
inject syntax logic as part of the parsing process. Each token belongs to one of the following class categories:
Tokens that run to the right with no left context. An example would be unary minus
-1
. Such tokens are denoted asNull-Denotation
orNUD
for short.Tokens that operate from left to right with left context. An example would be an infix operator: 1 + 2. Such tokens are denoted as
Left-Denotation
orLED
for short.
This means that for each kind of token, we need to indicate beforehand whether it is NUD
or LED
. We achieve this by using a HashMap
. We choose
a hashmap because it allows us to access its values using a token as a key and because we have a finite set of tokens performance would not be an issue.
Let’s first introduce a function type for NUD
and LED
operations.
type NudParserOp =
fn(Token, usize, Rc<RefCell<Vec<Token>>>) -> Result<(Objects, usize), errors::KarisError>;
type LedParserOp =
fn(Objects, usize, Rc<RefCell<Vec<Token>>>) -> Result<(Objects, usize), errors::KarisError>;
Next, we define a ParserType
type that specifies a unique NudParserOp
and/or LedParserOp
and a binding_power
for a specific token. This type will be used internally by the parser to initiate the appropriate parsing operation.
#[derive(Debug, Default, Clone)]
pub struct ParserType {
pub nud_fn: Option<NudParserOp>,
pub led_fn: Option<LedParserOp>,
pub binding_power: Option<usize>,
}
We now introduce a TokenRegistry
that is a hash map of a token identifier and it’s corresponding ParserType
.
#[derive(Debug, Default, Clone)]
pub struct TokenRegistry {
pub registry: HashMap<IdentifierKind, ParserType>,
}
impl TokenRegistry {
pub fn new() -> TokenRegistry {
let mut tg = TokenRegistry::default();
tg.get_token_registry();
tg
}
fn get_token_registry(&mut self) {
self.consumable(IdentifierKind::SEMICOLON);
self.consumable(IdentifierKind::EOS);
self.consumable(IdentifierKind::EOF);
/* trimmed for readability */
}
fn register(&mut self, symbol: IdentifierKind, obj: ParserType) {
if let Some((_key, val)) = self.registry.get_key_value_mut(&symbol) {
if let Some(f) = obj.nud_fn {
val.nud_fn = Some(f);
}
if let Some(f) = obj.led_fn {
val.led_fn = Some(f);
}
if let Some(f) = obj.binding_power {
val.binding_power = Some(f);
}
} else {
self.registry.insert(symbol, obj);
}
}
}
In Line 8 we initialize the registry with default values then call the get_token_registry
function that sets up different ParserType
for each token.
With the registry ready, we can now return to complete the expression
function. We add a closure that will take a token, fetch it’s ParserType
and return it.
let parser_type_fn = |s: IdentifierKind| -> Result<ParserType, errors::KarisError> {
let rg = TokenRegistry::new();
let parser_type = rg.retrieve_from_registry(s);
if parser_type.is_none() {
return Err(errors::KarisError {
error_type: errors::KarisErrorType::UnknownToken,
message: format!(
"[UNKNOWN TOKEN] Token {:?} Ln {} Col {}",
token.literal, token.line_number, token.column_number
),
});
}
let pt = parser_type.unwrap().clone();
Ok(pt)
};
Once we have the appropriate ParserType
we call both the nud
and led
functions for the token.
let (mut left, mut worked_on_index) = match pt0.nud_fn {
Some(func) => {
let res = func(token.clone(), index, bucket.clone())?;
(res.0, res.1)
}
None => (Objects::TyUnknown, 0x00),
};
In Line 1 we match whether the current token has a nud
function. If present, we call it and return it’s result as the left-hand side expression. Otherwise we return an unknown object. The the left-hand side at hand, we now move to the led
function.
if let Some(next_token) = bucket.borrow().get(worked_on_index + 0x01) {
let pt1 = parser_type_fn(next_token.token_type)?;
if let Some(bp) = pt1.binding_power {
if rbp < bp {
if let Some(func) = pt1.led_fn {
let idx = worked_on_index + 0x01;
let res = func(left, idx, bucket.clone())?;
left = res.0;
worked_on_index = res.1;
} else {
return Err(errors::KarisError {
error_type: errors::KarisErrorType::InvalidSyntax,
message: format!(
"[INVALID SYNTAX] Not an infix; Token {:?} Ln {} Col {}",
token.literal, token.line_number, token.column_number
),
});
}
}
} else {
return Err(errors::KarisError {
error_type: errors::KarisErrorType::MalformedProgram,
message: format!(
"[MISSING BINDING POWER] ; Token {:?} Ln {} Col {}",
token.literal, token.line_number, token.column_number
),
});
}
}
In Line 1 use the index of the current token to access the next token from the bucket and extract its ParserType
as seen in Line 2. In
Line 4 we check if the binding of the next token is less than the value of rbp
. If the assertion passes, we check for the presence of its led
function
and then call it. For complex expressions, the process calls itself over and over until the entire expression is parsed. With that, we have the core functionality of our parser.
Register LET
token parsing procedure methods
We begin with the TokenRegistry
. We define a method that registers the nud
and led
procedure for the LET
token.
fn add_let_binding(&mut self) {
let obj = ParserType {
nud_fn: Some(Self::parse_let_expressions),
led_fn: None,
binding_power: Some(10),
};
self.register(IdentifierKind::LET, obj);
}
In Line 2 we instantiate a #key# ParserType: Object containing token Null-Denotation and Left-Denotation functions #key# passing in corresponding values. Let’s first describe how the LET
token will be parsed. Recall the nud
function is meant to handle situations where there is no left context while the led
function is meant to be called when there is a left context. Consider a let
binding in Karis:
let age @int = 10;
The token in focus is let
which has anything defined before it. In this case, we state that let
has not left context. Therefore parsing the LET
token will involve reading the entire statement from left to right until the semicolon is encountered. Thus, we only need to specify the nud
function which we do in Line 3. In Line 5 we give the token a binding power of 10
. The actual value of the binding power is arbitrary. In the last line, we now add the ParserType
object for the LET
token in the register.
Implementing the LET
parser
As shown above, the core of parsing the LET
resides in the parse_let_expressions
function. The base principle to parse a LET
expression goes as follows:
- We move the cursor to the right until an
ASSIGN (=)
token is encountered. - We then assert that the
variable name
token is available in its designated position otherwise return an error - Next, we assert that the
typing
token is available in its designated position otherwise return an error - We then collect all the tokens to the left of
=
and construct a node - Finally, we return the node and the current cursor position minus 1 ( index-0x01)
pub(crate) fn parse_let_expressions(tok: Token,index: usize,bucket: Rc<RefCell<Vec<Token>>>) -> Result<(Objects, usize), errors::KarisError> {
if tok.token_type != IdentifierKind::ASSIGN {
let new_index = index + 0x1;
let next_token = &bucket.borrow()[new_index];
return Self::parse_let_expressions(next_token.clone(), new_index, bucket.clone());
}
}
In Line 2 to Line 5 we move along the expression recursively until we reach the ASSIGN
token. The ASSIGN
token is a signal that the tokens on the
left contain a binding variable and typing information while the token on the right contains either a value or another expression. With the ASSIGN
token found, we now dissect the bucket to find where the variable name and typing information are. Consider the following example:
let age @int = 10;
As from the above, we can see that the LET
expression structure is deterministic and we can accurately determine the index of tokens left of the ASSIGN
.
LET
is index 0age
is index 1@int
is index 2
With this information, we can now add some additional logic to check and verify the presence of both variable name and typing information.
// parser validation : check if variable name has been specified
let variable_name_token = borrow.get(index_of_let_ident + 0x01);
if variable_name_token.is_none() {
return Err(errors::KarisError {
error_type: errors::KarisErrorType::MissingVariableName,
message: format!(
"[MISSING VARIABLE NAME] Expected to find `<variable name>` ; Token {:?} Ln {} Col {}",
tok.literal, tok.line_number, tok.column_number
),
});
}
// parser validation : check if typing information has been provided
let typing_token = borrow.get(index_of_let_ident + 0x02);
if typing_token.is_none() {
return Err(errors::KarisError {
error_type: errors::KarisErrorType::MissingTypeInfo,
message: format!(
"[MISSING TYPE INFO] Expected to find either `@int | @string | @bool | @unit` ; Token {:?} Ln {} Col {}",
tok.literal, tok.line_number, tok.column_number
),
});
}
As mentioned earlier, the right-hand side of the ASSIGN
token can be either a value or another expression. For now, we will restrict ourselves to handling
only values.
let typing_token = typing_token.unwrap();
if typing_token.token_type == IdentifierKind::LSQUAREBRACE {
/* trimmed for readability */
} else {
let node = Node {
identifier_kind: Some(IdentifierKind::LET),
variable_name: Some(variable_name_token.unwrap().literal.clone()),
return_type: Some(Self::typing_kind(typing_token)),
..Default::default()
};
// return index that points to the `typing` not the `Assign`
Ok((Objects::TyNode(node), index - 0x1))
}
In Line 5 to Line 10 we construct a Node
object for the LET
token and return back up the stack to build our abstract syntax tree. This is only true for values. For expressions, the logic would involve calling the expression
function which will recursively build nodes for each part of the expression.
With that our parser is ready to parse the LET
expressions.
Register RETURN
token parsing procedure methods
We begin with the #key# TokenRegistry: HashMap of token parsing procedures #key#. We define a method that registers the nud
and led
procedure for the RETURN
token.
fn add_return(&mut self) {
let obj = ParserType {
nud_fn: Some(Self::parse_return),
led_fn: None,
binding_power: Some(0x00),
};
self.register(IdentifierKind::RETURN, obj);
}
Because the RETURN
does not have any left context, we will only define it’s nud
function. In Line 5 we set it’s `binding power as zero(0) which
indicates the token a the least precedence.
Implementing the RETURN
parser
Let’s consider the following simple return statements:
return true;
return 10;
return add(1,2);
From the above, we can see that each return statement terminates with a SEMICOLON
. What is returned to the call can be either a value or an expression as seen in Line 3. We can now deduce how the actual parser will work:
- First, we move the cursor to point to the next token after the
RETURN
token and parse it. - We then use the returned index to move to the last token and assert whether it’s a
SEMICOLON
. If aSEMICOLON
is absent we return a parser error. Otherwise, we return theRETURN
node as part of the abstract syntax tree.
Let’s see the implementation in action.
pub(crate) fn parse_return(tok: Token,index: usize,bucket: Rc<RefCell<Vec<Token>>>) -> Result<(Objects, usize), errors::KarisError> {
let (obj, index) = Parser::expression(0, index + 0x01, bucket.clone())?;
}
The current index in Line 1 points to the RETURN
token. We, therefore, increase the index by one and call the expression
function on the new index as
seen in Line 2. The result will be a node object and the index of the last token to be parsed. The next step is to check for the presence of a SEMICOLON
.
let borrow = bucket.borrow();
let next_index = index + 0x01;
let next_token = borrow.get(next_index);
if next_token.is_none() {
return Err(errors::KarisError {
error_type: errors::KarisErrorType::InvalidSyntax,
message: format!(
"[INVALID SYNTAX] Syntax not correct. Expected something after `{}. Ln {} Col {} '",
tok.literal,tok.line_number,tok.column_number
),
});
}
let next_token = next_token.unwrap();
if next_token.token_type != IdentifierKind::SEMICOLON {
return Err(errors::KarisError {
error_type: errors::KarisErrorType::InvalidSyntax,
message: format!(
"[INVALID SYNTAX] Syntax not correct. Expected tail `;` `{}. Ln {} Col {} '",
tok.literal, tok.line_number, tok.column_number
),
});
}
In Line 1 we get a copy of our bucket of tokens. Using the index from the result of the expression
call, we increase the index again and retry to access its value from the bucket as seen in Line 3 and Line 4. A valid syntax should have a token at that position. Otherwise, we return an error as seen in Line 6. If a token is found we can confidently extract its value in Line 15 and then try to identify whether it is a SEMICOLON
. We expect a
SEMICOLON
to be present, otherwise, it means our syntax is incorrect hence we return an error as seen in Line 17. If everything checks out, we need to build a return object node that will be appended in our abstract syntax tree.
let node = Node {
identifier_kind: Some(IdentifierKind::RETURN),
right_child: Some(Right(Box::new(obj))),
..Default::default()
};
// move the index to the adjacent `rbrace` token
Ok((Objects::TyNode(node), index + 0x01))
It is important to note we should move the cursor to point to the token after the semicolon. According to our syntax rules, this token will be the right-hand closing brace or RBRACE
for short. We do this so that the next run of expression
function can begin at the right position.
To correctly parse a string literal we first need to understand how a string literal is framed in Karis Language. Consider the following example:
let name @string = "Karis";
We can see that the string begins after the ASSIGN
token. This means that we need a handle for tokens after the ASSIGN
token. It should be noted that
these tokens can be literals or other complex expressions like functions. We will however limit ourselves to handling strings.
Right-hand side tokens parsing
We begin by instantiating a ParserType
object inside the add_assign
function.
fn add_assign(&mut self) {
let obj = ParserType {
nud_fn: None,
led_fn: Some(Self::parse_assign_operator),
binding_power: Some(30),
};
self.register(IdentifierKind::ASSIGN, obj);
}
The most vital in Line 4 which injects a LED
function. For the ASSIGN
token it has both the left context and right context. It’s resulting
node object needs to have both sides which may resemble the below:
ASSIGN(=)
/ \
/ \
LET(name) "karis"
In Line 5 we set it’s binding power to a significantly higher value because we it to always be at the center of the resulting node object. We now proceed to introduce the parse_assign_operator
function.
pub(crate) fn parse_assign_operator(left: Objects,token_index: usize,bucket: Rc<RefCell<Vec<Token>>>) -> Result<(Objects, usize), errors::KarisError> {
todo!()
}
Notice in Line 1 that the first argument the function expects is an Object
which indicates the left-hand side of the assignment. In all scenarios, the left-hand side will always be a LET
binding. Recall that we will build an appropriate node object hence we need to have access to the left-hand side node object. Next, we add a few validation checks to make sure the passed input is of the correct syntax.
let (current_token, next_token) = Self::preconditions(
token_index,
bucket.clone(),
vec![
IdentifierKind::EOS,
IdentifierKind::EOF,
IdentifierKind::SEMICOLON,
IdentifierKind::LET,
IdentifierKind::GT,
IdentifierKind::LT,
IdentifierKind::GTOREQ,
IdentifierKind::LTOREQ,
IdentifierKind::EQ,
IdentifierKind::SLASH,
IdentifierKind::MODULUS,
IdentifierKind::MAIN,
IdentifierKind::END,
],
)?;
The main idea here is that if the next token after the ASSIGN
token is not what the parser expects we then return an error. Let’s demonstrate such erroneous
syntax with a few examples:
Error | Snippet |
---|---|
Incomplete assignment | karis let name @string = |
Invalid program | karis = |
Invalid Keyword assignment | ```karis |
let x @string = let; | |
let x @string = @main; | |
let x @string = @end; | |
``` | |
Invalid assignment | ```karis |
let name @string = ; | |
let a @int = >; | |
let b @int = <; | |
let c @int = >=; | |
let d @int = <=; | |
let e @int = =; | |
let e @int = /; | |
let e @int = %; | |
``` |
The next thing we do is to check if the next token, two steps forward is a SEMICOLON
. We do this because a SEMICOLON
will indicate that we are dealing with
a literal and not complex expression.
let borrow = bucket.borrow();
let two_step_token = borrow.get(token_index + 0x02);
if let Some(two_step_token) = two_step_token {
if two_step_token.token_type == IdentifierKind::SEMICOLON {
todo!()
}
}
In Line 1 we get a copy of the bucket containing all the tokens. In Line 2 we move the cursor two steps forward by increasing the current index by two then retrieve the token pointed by the new index. In Line 4 we check whether the found token is a SEMICOLON
. Inside the block, we now add logic to check if the next token is a STRINGLITERAL
and then return its appropriate node object.
if next_token.token_type == IdentifierKind::STRINGLITERAL {
let obj = StringValue { value: Some(next_token.literal) };
let literal = LiteralObjects::ObjStringValue(obj);
let node = Node {
identifier_kind: Some(current_token.token_type),
left_child: Some(Right(Box::new(left))),
right_child: Some(Left(literal)),
..Default::default()
};
// move the cursor to the end. This will finish the recursive call stask and return to the caller
return Ok((Objects::TyNode(node), token_index + 0x02));
}
With that, we can confidently parse string literals.