Have you ever wondered how programming languages like Python, Java or Ruby work? Do you have an interest in building your own language? If so, this series of posts is for you. We dive into building a novel programing language from first principles. Get ready to have your hands dirty. It’s going to be a fun ride. Access the final product here : https://github.com/daviddexter/karis-programming-language
Overview of character tokenization
The term “character” is relatively broad. 1
is a character. So is &
or #
.
For our purpose in this lesson, “character” will refer to any non-alphanumeric characters.
We do not consider numbers and alphabets in our endeavour. Karis accepts quite a decent number of
characters, except for ~
.
One-piece characters
These characters are the easiest to tokenize. We call them one-piece characters
because they are composed of only one character.
An example would be ;
or (
. Nowhere in Karis syntax do we expect to have ;;
or ((
.
Even visually, it looks fundamentally wrong.
COMMA
token
We represent a comma as COMMA
in the lexer. Let’s see how we will achieve it. The lexer.rs
file will expand the new_token
function
to contain the logic for the tokenization. In the current implementation, after removing whitespaces, we move to the match
expression, where we evaluate
the presence of a character. There are two possible cases, presence or absence. Each indicated with Some
and None
as
shown on Line 8 and 9, respectively.
fn new_token(&mut self) -> Result<tokens::Token, errors::KarisError> {
if self.read_position == 0x0 {
self.read_char();
}
self.skip_whitespace_and_string_escape();
match &self.ch {
Some(ch) => {}
None => { },
}
}
Let’s handle the absent (None
) bit first since it’s relatively straightforward. We earlier defined a unknown_token_error()
function.
We’ll put it into use here. Add this line in the None
block.
self.unknown_token_error(" ")
That’s all that we need to do. Let’s now move to the Some
block.
let ch_owned = ch.clone();
let tok = match ch.as_str() {
tokens::COMMA => Ok(tokens::Token::new(
tokens::IdentifierKind::COMMA,
ch_owned,
self.line_number,
self.position,
)),
}
self.move_current_position_and_read();
tok
In Line 1, we retrieve the current character in scope. Since the “character” is of type String, we can’t just assign it. In Rust,
the String does not implement the Copy trait. This trait allows value to be “copied” around in memory. Essentially, copy
creates duplicates in memory. Types that implement this trait are primitive types like u8
, u16
,i32
, bool
and &str
. String types are heap allocated. They lack native copy support. As such, we must explicitly state that we intend to copy
by calling it’s clone
method.
In Line 3, we match the current ch
against a bunch of patterns. This first pattern is tokens::COMMA
on Line 4. If the pattern matches, we return a token object whose identifier is of kind tokens::IdentifierKind::COMMA
. See, that was straightforward. The move_current_position_and_read
function defined earlier is called last to move the cursor forward so that we can process
the next character.
Similar one-piece character patterns
Moving on, let’s add similar pattern checks and return corresponding tokens.
Two-piece characters
Two-piece characters come in combinations of two. Consider a token signifying greater-than or equal to. Such will be represented by
>=
is a combination of >
and =
.
ASSIGN
or EQUAL
token
- table
| Character | Meaning | | = | Assign | | == | Equal |
Before we introduce the implementation, let’s first outline how the logic will work.
We want to match the current
ch
against anASSIGN
pattern. This match will signal that the lexer has encountered a=
symbol.Next, we check if the next character in the input text matches the pattern of
ASSIGN
. If a match is found, we move the cursor forward by one step and then report the lexer has encountered a==
symbol. Otherwise, we register the lexer has located a=
character only.
Let’s demonstrate it in code.
tokens::ASSIGN => match self.forward_is_any_token(vec![tokens::ASSIGN]) {
Some(val) => {
if val {
self.move_current_position_and_read();
Ok(tokens::Token::new(
tokens::IdentifierKind::EQ,
ch_owned,
self.line_number,
self.position,
))
} else {
Ok(tokens::Token::new(
tokens::IdentifierKind::ASSIGN,
ch_owned,
self.line_number,
self.position,
))
}
}
None => Ok(tokens::Token::new(
tokens::IdentifierKind::ASSIGN,
ch_owned,
self.line_number,
self.position,
)),
}
In Line 1, we utilize the forward_is_any_token
helper function to retrieve the next token if it matches ASSIGN
. We expect the
process to return a value or nothing. In Line 2, we are sure a result of the function is present. We check whether indeed
a match is found in Line 3 and then return an EQ
token. Otherwise, in Line 12, we produce an ASSIGN
token.
OR
token
- table
| Character | Meaning | | | | Logical OR | | || | OR |
Before we introduce the implementation, let’s first outline how the logic will work.
We want to match the current
ch
against aPIPE
pattern. This match will signal that the lexer has encountered a|
symbol.Next, we check if the next character in the input text matches the pattern of
PIPE
. If we find a match, we move the cursor forward by one step and then report the lexer has encountered a||
symbol. Otherwise, we register the lexer has discovered a|
character only.
Let’s demonstrate it in code.
tokens::PIPE => match self.forward_is_any_token(vec![tokens::PIPE]) {
Some(val) => {
if val {
self.move_current_position_and_read();
let lit = format!("{:?}{:?}", tokens::PIPE, tokens::PIPE);
Ok(tokens::Token::new(
tokens::IdentifierKind::OR,
lit,
self.line_number,
self.position,
))
} else {
Ok(tokens::Token::new(
tokens::IdentifierKind::LOR,
ch_owned,
self.line_number,
self.position,
))
}
}
None => Ok(tokens::Token::new(
tokens::IdentifierKind::LOR,
ch_owned,
self.line_number,
self.position,
)),
},
In Line 1, we utilize the forward_is_any_token
helper function to retrieve the next token if it matches PIPE
. We expect the
process to return a value or nothing. In Line 2, we are sure a result of the function is present. We check whether indeed
a match is found in Line 3 and then return a OR
token. Otherwise, in Line 12, we return a LOR
token.
AND
token
- table
| Character | Meaning | | & | Logical AND | | && | AND |
Tokenization of &
character borrows it’s implementation from the tokenization of either =
or |
.
The code snippet below demonstrates that.
tokens::AMPERSAND => match self.forward_is_any_token(vec![tokens::AMPERSAND]) {
Some(val) => {
if val {
self.move_current_position_and_read();
let lit = format!("{:?}{:?}", tokens::AMPERSAND, tokens::AMPERSAND);
Ok(tokens::Token::new(
tokens::IdentifierKind::AND,
lit,
self.line_number,
self.position,
))
} else {
Ok(tokens::Token::new(
tokens::IdentifierKind::LAND,
ch_owned,
self.line_number,
self.position,
))
}
}
None => Ok(tokens::Token::new(
tokens::IdentifierKind::LAND,
ch_owned,
self.line_number,
self.position,
)),
},
In Line 1, we utilize the forward_is_any_token
helper function to retrieve the next token if it matches AMPERSAND
. We expect the
process to return a value or nothing. In Line 2, we are sure a result of the function is present. We check whether indeed
a match is found in Line 3 and then return a AND
token. Otherwise, in Line 12, we return a LAND
token.
NOTEQ
or BANG
token
We represent the NOTEQ
token using a combination of !
and =
while the BANG
token is represented by a single !
.
As we will see later, !
is eventually evaluated to mean negate
. Let’s see how we tokenize it in code.
tokens::BANG => match self.forward_is_any_token(vec![tokens::ASSIGN]) {
Some(val) => {
if val {
self.move_current_position_and_read();
Ok(tokens::Token::new(
tokens::IdentifierKind::NOTEQ,
ch_owned,
self.line_number,
self.position,
))
} else {
Ok(tokens::Token::new(
tokens::IdentifierKind::BANG,
ch_owned,
self.line_number,
self.position,
))
}
}
None => Ok(tokens::Token::new(
tokens::IdentifierKind::BANG,
ch_owned,
self.line_number,
self.position,
)),
},
In Line 1, we retrieve the next character and compare it against the ASSIGN
token. At this point, we already know that the current character
is a !
. If the match asserts the affirmative, we move the cursor forward as shown on Line 4, and then we return a NOTEQ
token. Otherwise,
we produce a BANG token. Tokenizing LTOREQ
, LT
, GTOREQ
and GT
tokens follow the same logic pattern of asserting the next character against the ASSIGN
token and appropriately returning the corresponding token. We now have the infrastructure to tokenize characters in our source text.
Anatomy of tests in Rust
Rust comes with a powerful test runner that allows us to get to testing without any external dependencies.
At its simplest, a test is a normal function decorated with a #[test]
attribute. Let’s explore further
with a concrete example.
In the lexer.rs
, we will add the snippet below:
#[cfg(test)]
mod lexer_tests {
use super::*;
#[test]
fn should_test_something() {
// test body
}
}
In Line 2, we introduce a new module, lexer_tests
. The module’s name can be anything we need it to be.
What is important is the #[cfg(test)]
attribute on Line 1. The attribute tells the test runner
that our lexer_tests
module holds tests. In Line 3, we “import” everything from the parent namespace into
the module’s namespace. This importation will allow us to access definitions in the parent namespace. In Line 5, the function
should_test_something
has an attribute above it. This attribute does only one thing. It identifies this function
as a test. Hence it will not be part of the final binary after compilation.
Tests implementation
Let’s now add an actual test case to verify that our lexer tokenizes characters.
Verify character tokenization
We will begin by writing a test that asserts whether our current implementation can read any character we pass
to it and return the corresponding token. Inside the lexer_tests
module, add this test:
#[test]
fn should_read_char() {
let mut lx = Lexer::new(String::from(",;(){}=+-!*/<>"));
}
In Line 3, we instantiate the Lexer
with a string input containing characters. The expectation is that the lexer
will be able to pinpoint what each character represents. As such, we need to add an assertion. We expect the first character to be a COMMA
by examining the input string. Let’s prove that.
assert_eq!(
lx.new_token().unwrap().token_type,
tokens::IdentifierKind::COMMA
);
Wonderful. It works. Let’s trigger a failure and see how it behaves. Amend the expectation from tokens::IdentifierKind::COMMA
to
tokens::IdentifierKind::ASSIGN
then rust the test again. It fails. Exactly what we expected. Let’s now move on to add more assertions. Each character will have its assertion. We now have a complete test verifying our lexer can read characters and return their corresponding token kinds.
Verify one-piece and two-piece tokenization
We will feed the lexer with an input string containing both one-piece and two-piece characters. The expectation is that it should be able to know whether a character is a one-piece or two-piece. For instance, >=
should not return two tokens; GT
and ASSIGN
. Instead, it should return a single token of the kind
GTOREQ
.
#[test]
fn should_read_equal_token0() {
let mut lx0 = Lexer::new(String::from("; == && || & |"));
/* similar assertions */
}
The overall logic structure of this test is quite similar to the one described above.
$ cargo test --package lexer --lib -- lexer::lexer_tests::should_read_equal_token0
Fantastic, it works as a charm.
Similar tokenization verifications
Let’s expand the test with relatively similar test cases.
Test for EQ
token
We begin with the EQ
token.
#[test]
fn should_read_equal_token1() {
let mut lx = Lexer::new(String::from("=="));
assert_eq!(
lx.new_token().unwrap().token_type,
tokens::IdentifierKind::EQ
);
}
In this test, given an input string that contains ==
, we expect the lexer to identify its character-set as an EQ
token.
Test an assemble of characters
In the following test case, we feed the lexer with a long sequence of characters. This test should validate that the lexer walks along the entire text length and identifies tokens. If the test succeeds, we can rest assured that the lexer can handle input texts of any size.
#[test]
fn should_read_equal_token4() {
let mut lx = Lexer::new(String::from("==+>=!()=<>*<=-/~ -"));
assert_eq!(
lx.new_token().unwrap().token_type,
tokens::IdentifierKind::EQ
);
/* trimmed for readability */
assert!(lx.new_token().is_err());
assert_eq!(
lx.new_token().unwrap().token_type,
tokens::IdentifierKind::MINUS
);
}
There is one peculiar thing about the test case above. Near the tail end of the input string, we see a rather strange character, ~
.
The character is not part of our token set. The lexer should therefore return when it encounters the character. And that’s what is happening in Line 9. We have now demonstrated that our lexer can;
1 . Identify any legal character and deduce its token. 2 . Identify illegal characters 3 . Handle both one-piece and two-piece characters and deduce their respective tokens.
Overview of typings tokenization
Karis is a statically typed language. The compiler that converts programs written in Karis needs to know what “kind” of data a variable has.
This typing information is explicitly declared using syntax markers that the language provides. That is @int
, @string
,
@bool
, and @unit
.Here, we will implement mechanisms to tokenize such typing information. We will also include tokenization of @main
and @end.
since they share almost a similar tokenization logic flow. Let’s examine the structure of typing information.
@int
@string
@bool
Each information has a prefix of @
followed by an alphabetic sequence of characters. With this insight at hand, we now know that tokenizing a
“Typing”, we first need to match their prefix and tail sequence.
Implemenation
We define a new function, extract_token_from_alphabet()
. As its name indicates, this function will be responsible for extracting the token
from a sequence of alphabet characters. Essentially, any character that is not a symbol, except for @
.
fn extract_token_from_alphabet(&mut self) -> Result<tokens::Token, errors::KarisError> {
// implementation here
}
Inside the function, let’s add a match
and run a pattern match on the current character under observation.
match &self.ch {
Some(current_literal) => {}
None => self.unknown_token_error(tokens::NULL),
}
Rightly so, if we fail to match the current character, we return a NULL
token. Before the match block, we will introduce a reusable
the closure that looks ahead in the input string and returns the corresponding character.
// peek forward
let next_char_fn = |begin: usize, end: usize, default| self.input.get(begin..end).unwrap_or(default);
Inside the Some
, we will call the closure to retrieve the next token from the input string.
let next_char = next_char_fn(
self.position + 0x1,
self.read_position + 0x1,
tokens::SEMICOLON,
);
Now that we have the current character and a means to get the next character, we can now add some checks to assert either of the following:
- If the current character is
@
, are the following four charactersmain
? - If the current character is
@
, are the following three charactersend
? - If the next character is
@
, do the following sequence match eitherint
,string
,bool
orunit
?
Let’s handle the first check.
if current_literal == tokens::AT && next_char_fn(self.position, self.read_position + 0x4, tokens::NULL) == tokens::MAIN
{
match self.input.get(self.identifier_start_read_position as usize..self.read_position + 0x4)
{
Some(ident) => {
self.identifier_start_read_position = -0x1;
self.position += 0x4;
self.read_position += 0x4;
let ident_owned = String::from(ident);
Ok(tokens::Token::new(
tokens::IdentifierKind::MAIN,
ident_owned,
self.line_number,
self.position,
))
}
None => self.unknown_token_error(tokens::NULL),
}
}
In Line 1, we compare the current character if it is of a kind AT
and whether a sequence of four characters after the current character is
of kind MAIN
. When the condition is accurate, we are sure that we have encountered a @main
pattern. From Line 7 to Line 8, we reset the cursor position
by moving four steps forward. Finally, in Line 11, we return a token of an identifier MAIN
. For END
, the steps are similar except for the beginning, where we retrieve a sequence of three characters instead of four.
if current_literal == tokens::AT && next_char_fn(self.position, self.read_position + 0x3, tokens::NULL)== tokens::END
Good. We are making progress. The next check is a little bit involved. Once we assert that the current character is @
, we will retrieve the sequence of
characters between the identifier_start_read_position
and read_position
. These characters are then checked to assert if they
match @int
, @string
, @bool
or @unit
.
if next_char == tokens::AT{
match self.input.get(self.identifier_start_read_position as usize..self.read_position)
{
Some(ident) => {
let ident_owned = String::from(ident);
let tok = match ident {
tokens::INT => Ok(tokens::Token::new(
tokens::IdentifierKind::INTTYPE,
String::from(ident),
self.line_number,
self.position,
)),
/*trimmed for readability*/
_ => todo!("todo")
};
self.identifier_start_read_position = -0x1;
tok
}
None => self.unknown_token_error(tokens::NULL),
}
}
Notice that we have introduced another set of todos. While both will be retired at some point, the empty todo is interesting. Rust does not allow
an if
expression without the corresponding else
. Hence we are forced to add an empty one as a placeholder. With extract_token_from_alphabet
in place, we will inject it into the main logic flow under the new_token
function. At the tail of our current new_token
function, we have a todo that we will now retire and replace with:
{
if self.identifier_start_read_position == -0x1 {
self.identifier_start_read_position = self.position as isize;
}
self.extract_token_from_alphabet()
}
We will expand our test suite to verify tokenization of typing information works correctly.
Verify @int
Let’s begin by adding a simple test.
#[test]
fn should_read_equal_int_typing() {
let mut lx = Lexer::new(String::from("@int"));
assert_eq!(
lx.read_tokens().unwrap().token_type,
tokens::IdentifierKind::INTTYPE
);
}
As you can see, very simple.
$ cargo test lexer::lexer_tests::should_read_equal_int_typing
Well, that was unexpected. The test fails. It seems our implementation reaches the todo!("empty todo")
block. Why is that?
In Line 412, we check the next_char
instead of the current one. This check makes no sense. Let’s go back and have a look at a
valid Karis syntax.
let echo @unit = fn(name @string){
print("Name #name")
};
From what we have learned about lexer, it walks down the source code line by line until it finishes.Our implementation however does not
satisfy that requirement. If we feed it the snippet above, it will only run on the first line. What we need is to add multi-line support.
Why is multi-line support necessary? In practice, we write source code in text files and expect the programming language to read the source code from the text file. The anatomy of a text file is that it contains entries along each line, with “lines” separated by a \n
ASCII character. In our case, Karis’s lexer does not have such an ability since it does not know how to handle line breaks. Let’s change that. In Line 412, we amend the else if
condition to match the below snippet.
else if is_space(next_char.as_bytes().first().unwrap())
|| next_char == tokens::COMMA
|| next_char == tokens::LBRACE
|| next_char == tokens::RPAREN
|| next_char == tokens::AT
|| next_char == tokens::SEMICOLON
|| next_char == tokens::LPAREN
Here, we check whether the next character is either a whitespace
, comma
, left brace
, right parentheses
, at
, semicolon
, or left parentheses
.
The reasoning for this change is that a line can end with either a left brace {
, a semicolon ;
, or a right brace }
. Likewise, a new line
begins with whitespace or alphanumeric characters. In Line 2, we handle a somewhat subtle edge case of entity separation with a comma. This proves significant when we tokenize function arguments.
The next change is retiring todo! ("empty todo")
with a proper implementation. The lexer will need to run through the entire in some fashion. Either iteratively or recursively. Using an iterator will not fit our use case. We will therefore employ recursion. extract_token_from_alphabet
will
call itself recursively, moving the cursor along the input string.
self.move_current_position_and_read();
self.extract_token_from_alphabet();
With those changes, we are now ready to run our tests. Try rerunning the test.
$ cargo test lexer::lexer_tests::should_read_equal_int_typing
Nice!!! It passes as expected.
Finalize typing tokenization verification
Testing for the rest of typing information will follow the same pattern. See the full implementation below.
$ cargo test lexer::lexer_tests
We have now verified typings tokenization is working correctly on to the next one.
Overview of Karis keywords
Karis has a limited set of keywords.
| Keyword | IdentifierKind |
| let
| LET |
| true
| TRUE |
| false
| FALSE |
| if
| IF |
| else
| ELSE |
| return
| RETURN |
fn
, which describes a function, is also a keyword. However, in this lesson, we will limit our scope to the keywords above.
Tokenization implementation
We pick from where we left off, which looks like below.
let tok = match ident {
tokens::INT => Ok(tokens::Token::new(
tokens::IdentifierKind::INTTYPE,
String::from(ident),
self.line_number,
self.position,
)),
/* trimmed for readability */
_ => todo!("todo")
};
We focus on retiring the todo
on Line 9 with valid match cases for each keyword. Tokenizing keywords
follow the same logic. First, we match the identifier string against the semantics of a particular token. If the match succeeds, we return the token representation. The thing to be keen about is the identifier string which is a block from the input string starting from the identifier_start_read_position
position to the read_position.
Recall :
identifier_start_read_position
indicates the starting point when an identifier is been read.Recall:
read_position
points to the next character in the input string after the current character.
We visualize the input string as an array and slice to retrieve items contained between two bounds. In this case,
what we get is a &str
. We then match this &str
against keyword identifiers. Let’s demonstrate this.
match self.input.get(self.identifier_start_read_position as usize..self.read_position){
Some(ident) => {
let tok = match ident{
tokens::LET => Ok(tokens::Token::new(
tokens::IdentifierKind::LET,
String::from(ident),
self.line_number,
self.position,
)),
//** add more match patterns here **/
};
tok
}
}
In Line 1, we slice the input string retrieving entries between the specified bounds. In Line 3, we assign the match expression to a
variable tok
whose value will be evaluated inside the matching body. In Line 4, we check if the ident
matches LET
and then return a result.
That’s all that is involved in keyword tokenization. We need to add similar match patterns for the rest of the keywords.
Overview of the verification procedure
The manner to verify keyword tokenization is similar for each keyword. Let’s add a simple test to demonstrate.
#[test]
fn should_read_let_keyword() {
let mut lx = Lexer::new(String::from("let"));
assert_eq!(
lx.read_tokens().unwrap().token_type,
tokens::IdentifierKind::LET
);
}
As you can see, the test is somewhat self-explanatory. We initialize the lexer with an input text containing “let” and then
call read_tokens
and assert its result. Quite ease. Right. We can now run it.
$ cargo test lexer::lexer_tests::should_read_let_keyword
It works.
Edge cases verification
Let’s switch gears a little bit. We now restructure the input text to mimic a programming language syntax. We will feed the lexer input that looks like this:
if {
return;
}else{
return;
}
The intent is that we want to be sure that our lexer works with any input of any structure and correctly produces tokens for keywords. Consider this as stress testing of the lexer. Let’s add the test.
#[test]
fn should_read_if_else_return_keywords() {
let mut lx = Lexer::new(String::from("
if {
return;
}else{
return;
}
"));
assert_eq!(
lx.read_tokens().unwrap().token_type,
tokens::IdentifierKind::IF
);
/* trimmed for readability */
assert_eq!(
lx.read_tokens().unwrap().token_type,
tokens::IdentifierKind::RBRACE
);
}
- expects, it will always work. Let’s demonstrate this as a concrete example. Consider the Golang program.
column widget
- expects, it will always work. Let’s demonstrate this as a concrete example. Consider the Golang program.
column widget
- expects, it will always work. Let’s demonstrate this as a concrete example. Consider the Golang program.
column widget
- expects, it will always work. Let’s demonstrate this as a concrete example. Consider the Golang program.
column widget
- crete example. Consider the Golang program.
column widget
package main
import "fmt"
func main() {
if 7%2 == 0 {
fmt.Println("7 is even")
} else {
fmt.Println("7 is odd")
}
if 8%4 == 0 {
fmt.Println("8 is divisible by 4")
}
if num := 9; num < 0 {
fmt.Println(num, "is negative")
} else if num < 10 {
fmt.Println(num, "has 1 digit")
} else {
fmt.Println(num, "has multiple digits")
}
}
We can see there are a lot of differences in syntax compared to Karis. However, there are similarities between conditional statement formation and logical conditionals. When we pass this snippet to our lexer, it will only succeed in identifying what it knows or assumes to be correct. For instance, package
is interpreted as a VARIABLE
. Same to main
and import
. On the other hand, "fmt"
is interpreted as a string literal.