Have you ever wondered how programming languages like Python, Java or Ruby work? Do you have an interest in building your own language? If so, this series of posts is for you. We dive into building a novel programing language from first principles. Get ready to have your hands dirty. It’s going to be a fun ride. Access the final product here : https://github.com/daviddexter/karis-programming-language
Overview of parsing conditional expressions
Parsing conditional expressions is slightly different from the expressions we have encountered so far. Let’s examine how a conditional expression looks like:
if 5 > 3 {
return true;
}else {
return false;
}
In Line 1 we see the expression begins with the IF
token followed by a condition statement then breaks into a block with the LBRACE
token. In Line 3 the alternate condition begins with the ELSE
token and continues to break into a block. Therefore the steps to parse a conditional expression begins with breaking each part of the expression into it’s on its own distinct “program” and then merging them to form a node object. In the Program
type, there is a non_root
field which we will specify for every expression part.
Register parsing procedure methods
We introduce three parsing procedures, one for the LBRACE
token, one for the IF
token and the other for the ELSE
token.
fn add_if_statement(&mut self) {
let obj = ParserType {
nud_fn: Some(Self::parse_if_else_expressions),
led_fn: None,
binding_power: Some(10),
};
self.register(IdentifierKind::IF, obj);
}
fn add_else_statement(&mut self) {
let obj = ParserType {
nud_fn: Some(Self::parse_if_else_expressions),
led_fn: None,
binding_power: Some(10),
};
self.register(IdentifierKind::ELSE, obj);
}
fn add_left_brace_statement(&mut self) {
let obj = ParserType {
nud_fn: Some(Self::parse_block),
led_fn: None,
binding_power: Some(10),
};
self.register(IdentifierKind::LBRACE, obj);
}
IF
and ELSE
token
Both tokens use the same parsing function parse_if_else_expressions
. Let’s add it.
pub(crate) fn parse_if_else_expressions(tok: Token,index: usize, bucket: Rc<RefCell<Vec<Token>>>) -> Result<(Objects, usize), errors::KarisError> {
todo!("add implementation here");
}
The first step is to check whether the current token is either an IF
or ELSE
. Any other token would be considered invalid.
let borrow = bucket.borrow();
if borrow.get(0x0).is_none() {
return Err(errors::KarisError {
error_type: errors::KarisErrorType::MissingConditionalIndentifier,
message: format!(
"[MISSING CONDITIONAL IDENTIFIER] Expected to find `if` or `else` ; Token {:?} Ln {} Col {}",
tok.literal, tok.line_number, tok.column_number
),
});
}
Next, we add another validation of checking whether there is an LBRACE
token. We do this by moving the cursor forward until we find the token. In the event the token is absent we return an error indicating the source text has an invalid syntax.
let mut end_index: usize;
#[allow(clippy::redundant_clone)]
let index_at_if_lbrace = Self::traverse_forward_until(
tok.clone(),
index,
bucket.clone(),
IdentifierKind::LBRACE,
)?;
We then extract the tokens between the IF
token and LBRACE
and parse them in isolation.
let items_before_if_lbrace = borrow.get(index + 0x01..index_at_if_lbrace).unwrap();
let exp_vec_tokens = Vec::from(items_before_if_lbrace);
let expression_node = Parser::default()
.program_as_non_root()
.parse_from_vec(exp_vec_tokens)?;
In Line 1 we index into the bucket using the current index and the index of the LBRACE
token. In Line 2 we then create a new bucket and call the parse_from_vec
function of the Parser
object which parses the tokens and returns an appropriate expression object. Next, we retrieve the index of the RBRACE
token closing the IF
block. We then extract all the tokens with the block, construct a new bucket and call the parse_from_vec
function of the `Parser.
let index_at_if_rbrace = Self::traverse_forward_until(
tok.clone(),
index,
bucket.clone(),
IdentifierKind::RBRACE,
)?;
let items_after_lbrace = borrow
.get(index_at_if_lbrace + 0x01..index_at_if_rbrace)
.unwrap();
let if_block_vec_tokens = Vec::from(items_after_lbrace);
let if_block_node = Parser::default()
.program_as_non_root()
.parse_from_vec(if_block_vec_tokens)?;
We check for the presence of an ELSE
token and do the same thing we have done above. Finally, we wrap all objects and return the corresponding node object.
let node = Node {
identifier_kind: Some(IdentifierKind::IF),
right_child: Some(Right(Box::new(expression_node))),
alternate: alternate_node,
block_children: Some(Vec::from([if_block_node])),
..Default::default()
};
Ok((Objects::TyNode(node), end_index))
Overview of parsing functions expressions
A function in Karis begins with the keyword fn
then followed by a LPAREN
and optionally arguments.
Therefore parsing a function will follow the following steps:
- Identify the
fn
keyword to validate it’s a function - Collect any arguments that may be bounded within the
LPAREN
andRPAREN
tokens - Create a node object representing the function definition inserting the arguments
Register parsing procedure methods
We start by registering the appropriate parsing functions similar to what we have done for the previous literals and expressions.
fn add_function_declaration(&mut self) {
let obj = ParserType {
nud_fn: Some(Self::parse_function_definition),
led_fn: None,
binding_power: Some(70),
};
self.register(IdentifierKind::FUNCTION, obj);
}
We continue with now introducing the parse_function_definition
which will accomplish the actual parsing.
pub(crate) fn parse_function_definition(tok: Token,index: usize,bucket: Rc<RefCell<Vec<Token>>>) -> Result<(Objects, usize), errors::KarisError> {
todo("add implementation here")
}
Like defensive engineers which we are, we first validate that the function definition is correct before we proceed to parse.
let borrow = bucket.borrow();
let next_index = index + 0x01;
let next_token = borrow.get(next_index);
if next_token.is_none() {
return Err(errors::KarisError {
error_type: errors::KarisErrorType::InvalidSyntax,
message: format!(
"[INVALID SYNTAX] Syntax not correct. Expected something after `{}. Ln {} Col {} '",
tok.literal,tok.line_number,tok.column_number
),
});
}
In Line 1 to Line 3 we get a copy of the bucket and increment the current index by one then use the new index to retrieve the next token. We then validate whether there is actually a next token in the bucket. If the validation fails we return an error. This validation will catch incorrect syntax like:
let echo @string = fn
The next validation is to ensure that the next token is an LPAREN
token.
let next_token = next_token.unwrap();
if next_token.token_type != IdentifierKind::LPAREN {
return Err(errors::KarisError {
error_type: errors::KarisErrorType::InvalidSyntax,
message: format!(
"[INVALID SYNTAX] Syntax not correct. Expected `(` after `{}. Ln {} Col {} '",
tok.literal, tok.line_number, tok.column_number
),
});
}
If the validation checks out, we proceed to call the helper function collect_function_definition_args
. This function will recursively move the cursor to the right collecting and parsing arguments that may be enclosed between the LPAREN
and RPAREN
tokens. It will return the arguments as a vector and the position of the LBRACE
token.
let (function_args, lbrace_index) = Self::collect_function_definition_args(
next_index + 0x01,
bucket.clone(),
Vec::new(),
0x00,
)?;
We add another validation this time checking whether the function has a body. A function without a body is invalid.
let lbrace = borrow.get(lbrace_index).unwrap();
if lbrace.token_type != IdentifierKind::LBRACE {
return Err(errors::KarisError {
error_type: errors::KarisErrorType::InvalidSyntax,
message: format!(
"[INVALID SYNTAX] Syntax not correct. Expected left-brace. Found {}. Ln {} Col {} '",
lbrace.literal, lbrace.line_number, lbrace.column_number
),
});
}
Once the validation succeeds, we they jump inside the block and again call the collect_function_definition_args
helper function to collect and parse expressions in the body.
let (block_children, last_index) = Self::collect_block_children(lbrace_index, bucket.clone(), Vec::new(), 0x00)?;
With that now we construct a node object representing the function definition passing in the appropriate arguments and children.
let node = Node {
identifier_kind: Some(IdentifierKind::FUNCTION),
func_params: Some(function_args),
block_children: Some(block_children_filtered),
..Default::default()
};
Ok((Objects::TyNode(node), last_index))
We are done parsing functions.
Overview of parsing function call expressions
A call expression begins with what the lexer will identify as a VARIABLE
. This is just a string then followed by LPAREN
and RPAREN
tokens.
Optionally, a function call expression can have call parameters that are passed to the function. Here are some examples:
// Without parameter
add()
// Parameters as literal
add(1, 2, 3)
// Parameters as function calls
add(one(),two())
// Parameters as variables
add(x,y)
Therefore, parsing a function call expression involves the following steps:
- Identifying the beginning of the function call expression
- Collect and parse any arguments that may be present between the
LPAREN
andRPAREN
tokens - Create a node object representing the call expression definition inserting the provided parameters
Register parsing procedure methods
Similar to what we have so far, we first register the appropriate parsing methods.
fn add_call_declaration(&mut self) {
let obj = ParserType {
nud_fn: Some(Self::parse_function_call),
led_fn: None,
binding_power: Some(70),
};
self.register(IdentifierKind::CALLER, obj);
}
We now introduce the parse_function_call
associated function which will handle the actual parsing.
pub(crate) fn parse_function_call(tok: Token,index: usize,bucket: Rc<RefCell<Vec<Token>>>) -> Result<(Objects, usize), errors::KarisError> {
todo!("add implementation here")
}
As it has become a custom in this course, we first add a few validations to ensure that we working with the correct syntax. From an engineering perspective, such defensive programming is advisable to make our entire program somewhat error-proof.
let borrow = bucket.borrow();
let next_index = index + 0x01;
let next_token = borrow.get(next_index);
if next_token.is_none() {
return Err(errors::KarisError {
error_type: errors::KarisErrorType::InvalidSyntax,
message: format!(
"[INVALID SYNTAX] Syntax not correct. Expected something after `{}. Ln {} Col {} '",
tok.literal,tok.line_number,tok.column_number
),
});
}
In Line 1 to Line 3 we retrieve the next token the check whether there is actually a next token. Afterwards, we examine the next token to see if it is an LPAREN
token. Recall that from the parser’s perspective the current token is a VARIABLE
token. The only way for it to be aware if it is handling a function call expression is if a (
is adjacent to the variable. Otherwise, the variable will treat like a variable only.
let next_token = next_token.unwrap();
if next_token.token_type != IdentifierKind::LPAREN {
return Err(errors::KarisError {
error_type: errors::KarisErrorType::InvalidSyntax,
message: format!(
"[INVALID SYNTAX] Syntax not correct. Expected `(` after `{}. Ln {} Col {} '",
tok.literal, tok.line_number, tok.column_number
),
});
}
Next, we call the helper function collect_function_call_params
which will move the cursor to the right until it reaches the closing RPAPEN
while collecting and parsing any available parameters.
let (params, closing_paren_index) =
Self::collect_function_call_params(next_index, bucket.clone(), Vec::new(), 0x00)?;
The actual implementation of the collect_function_call_params
function is quite involving. See the full implementation at the end of the lesson for details.
Once we have parsed all parameters, we then construct an appropriate node object representing the function call. We should however be aware that Karis has built-in functions that we need to handle as well. For now, however, we will leave a todo
to remind us to handle such functions later in the course.
let node = if tok.token_type == IdentifierKind::PRINT || tok.token_type == IdentifierKind::FORMAT {
todo!("add implementation here")
} else {
Node {
variable_name: Some(tok.literal),
identifier_kind: Some(IdentifierKind::CALLER),
call_params: Some(params),
..Default::default()
}
};
Ok((Objects::TyNode(node), closing_paren_index))
That concludes our implementation for parsing functions calls.