Have you ever wondered how programming languages like Python, Java or Ruby work? Do you have an interest in building your own language? If so, this series of posts is for you. We dive into building a novel programing language from first principles. Get ready to have your hands dirty. It’s going to be a fun ride. Access the final product here : https://github.com/daviddexter/karis-programming-language
Integer literals
Karis represents integers using a simple dialect. It limits its scope to numbers without decimal points. An entry like 100.00
is an
invalid value in Karis. Let’s add a mechanism to tokenize integers in the lexer.rs
file. Earlier, we introduced the is_integers_only
helper function. The function takes a character or sequence of characters and returns an assertion whether the character is an integer or not.
Therefore, we inject it into the lexer.rs
by adding the code below.
if is_integers_only(ident) {
Ok(tokens::Token::new(
tokens::IdentifierKind::INTLITERAL,
ident_owned,
self.line_number,
self.position,
))
}
The condition asserts the presence of an integer. Let’s demonstrate its internal working visually. Consider a value like 2290
.
The lexer will work on each element of the value.
2 2 9 0
For each element, it will check whether it is a digit. If the assertion is true, it returns that the entire sequence of items is an integer.
Consider a value like 22B3
.
2:digit 2:digit B:not digit 3:digit
As you can see in Line 1, is_integers_only
takes an ident
argument of type &str
, but returns it as type String
.
In Rust, &str
and String
are different. &str
can be moved from one function to another so long as
its “lifetime” is guaranteed. String
type, however, must have one owner at a time. We, therefore, need to convert ident
to an owned variant.
We do this by creating a copy of ident
using the clone
method.
let ident_owned = String::from(ident);
In Line 4, we use ident_owned
, created from the cloning process. That’s it. Integers literals tokenization is done.
Variable names and string literals
A string is interesting. Let’s examine the code below.
let name @string = "Karis";
Excluding let
, @string
, =
and ;
, name
and "Karis"
are represented as strings. In our context, however, one is a variable, while
the other is a literal. Let’s look at another example.
let add @int = fn(x @int, y @int){
return x + y;
};
add(1,1);
add in Line 5 is a function call. From a textual perspective, it is just a string. The key to identifying it as a function call is the adjust parentheses. A function call can have no arguments, single or multiple arguments.
add();
add(1,1);
add(1,2,3,4,5,);
The above are all good calls to a function. With all the above in context, it introduces an exciting set of problems we need to address. We need a way to distinguish:
- String literal
- Function call
- Variable names
String literal
We’ll begin with string literals. A string literal in its essence starts and ends with a quotation mark. This gives a signal of when a string literal
begins and ends. We’ll therefore introduce a condition to check for the presence of an opening quotation mark. We’ll amend extract_token_from_alphabet
.
if is_quotation_mark(current_literal) {
let string_literal_ending = self.traverse_to_closing_quotations(self.position + 0x1);
let string_literal = self
.input
.get(self.position + 0x1..string_literal_ending)
.unwrap();
self.identifier_start_read_position = -0x1;
self.position = string_literal_ending;
self.read_position = string_literal_ending + 0x1;
Ok(tokens::Token::new(
tokens::IdentifierKind::STRINGLITERAL,
String::from(string_literal),
self.line_number,
self.position,
))
}
In Line 1, we provide is_quotation_mark
with the current literal in focus and check whether it is a quote mark. If true, the next thing we need in
Line 2 is to recursively move along the length of the input string until we find a closing quote mark. This is the most crucial bit of our logic.
The recursive call returns the index of where the closing quote mark is located. Consider the example below as an example.
":0 a:1 l:2 e:3 x:4 ":5
Once we have the index of the closing quote mark, we then slice the input string to retrieve the literal between the quote marks, as seen in Line 4.
In Line 7, we reset identifier_start_read_position
, meaning the next recursive call will not assume to be working on an alphanumeric entry. Finally, we
move the cursor past the closing quote mark and return the string literal.
Function call and variable names
Handling functions call, and variables names use the same logic. We expand from the previous if
statement by adding a check for
alphanumeric characters. Lucky, we already have what we need in place. We added is_alphanumeric_only()
helper function. It works
similar to is_integers_only
with the only difference of checking for both digits and alphabets.
Consider a value like 22B3
.
2:digit 2:digit B:alphabet 3:digit
For the above, is_alphanumeric_only
will assert to true.
else if is_alphanumeric_only(ident) {
/* add more checks */
}
Inside the body of is_alphanumeric_only
, we will add a condition check for the next character. Why is this? The only difference between
function call and variable are, a function call has an opening parenthesis immediately following it.
if next_char == tokens::LPAREN {
Ok(tokens::Token::new(
tokens::IdentifierKind::CALLER,
ident_owned,
self.line_number,
self.position,
))
} else {
Ok(tokens::Token::new(
tokens::IdentifierKind::VARIABLE,
ident_owned,
self.line_number,
self.position,
))
}
In Line 1, we check if the next character in the input string is an LPAREN
. If so, we return the new token as a CALLER
or a VARIABLE
.
We now can handle a valid Karis syntax. The expectation is, given a source text, the lexer will recognize:
LET
binding- Typing information, for instance,
@int
,@string
- Operators, for instance,
+
,-
,*
- Variable names
- Integer literals
- String literals
We will introduce more test cases that will validate our expectations.
#[test]
fn should_read_equal_token5a() {
let mut lx = Lexer::new(String::from("let a = 1 + 2;"));
}
We define a test named should_read_equal_token5a
. In Line 3, we bootstrap the lexer and use let a = 1 + 2;
as it’s input. By examing the
input, we can intuitively deduce what the lexer should return on each call of read_tokens
.
let a = 1 + 2 ;
The assertion statement is pretty straightforward. We call the read_tokens
method and assert if the returned token matches what we expect.
assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::LET);
In Line 1, we match the result against the LET
. Let’s do another similar test. In place of the +
operator, we’ll switch with the -
operator.
#[test]
fn should_read_equal_token5b() {
let mut lx = Lexer::new(String::from("let a = 1 - 2;"));
/* trimmed for readability */
}
The lexer expects to identify the -
symbol and return the appropriate token. We are now sure that the current lexer implementation can correctly identify and tokenize
- A
LET
binding - Variable name
- Integer literal
- Addition operator
- Subtraction operator
Verify string literals tokenization
Next, we will assert string literals. We introduce another test with an input of type string
.
#[test]
fn should_read_equal_token7() {
let mut lx = Lexer::new(String::from("let name = \"alice\"; "));
}
Much is shared with the previous test. In this test, however, we expect the lexer to correctly identify "Alice"
input as a string, not as
variable or anything else.
#[test]
fn should_read_equal_token7() {
let mut lx = Lexer::new(String::from("let name = \"alice\"; "));
/* trimmed for readability */
assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::STRINGLITERAL);
/* trimmed for readability */
}
Line 5 is the core of the test where it asserts the presence of a string literal. We must examine our implementation and find the problem if this assertion fails.
We have one particular edge case that we need to verify; string interpolation. Consider a string like "My name is #name"
.
In this case, #name
will be replaced by an actual value. Such interpolation will be executed in built-in functions like
print
and format
. Let’s now add a test to verify string interpolation.
#[test]
fn should_read_equal_token8() {
let mut lx = Lexer::new(String::from("print(\"Name #name\");"));
assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::PRINT);
assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::LPAREN);
assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::STRINGLITERAL);
assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::RPAREN);
assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::SEMICOLON);
}
In Line 3, we give the lexer an input of the form print(\"Name #name\");
. We can deduce that the first token to be identified is
PRINT
. The interesting part is the Name #name
section. We expect the lexer to return a string literal otherwise, an error will occur
because the #
character is not an alphabet. We now can claim our lexer tokenizes literals.
Overview of function tokenization
A function definition in Karis begins with a binding to the fn
keyword. Like other programming languages,a function in Karis will accept
arguments for its internal logic. Every function in Karis must return something of a particular type. This “returns type” is annotated to
the binding of the function. For example, a function that adds two integers should return the sum of type @int
.Karis provides the @unit
type if a function has nothing to return. See the example below:
let printer @unit = fn(name @string){
// ^ means we discard the result of the function
print("Name #name")
};
In the above, the printer
function will print to the standard output, the name
. However, since Karis demands an explicit return
for each function, the @unit
type satisfies the requirement. We have two categories of functions to handle; built-in
and
user-defined
functions.
Format built-in function
Currently, we have one built-in function, print
. Let’s add another format
. This built-in function will evaluate a literal string containing
placeholders and replace the placeholders with corresponding values. The format
function will be called with the following syntax:
let echo @string = fn(name @string){
format("#name")
};
Tokenizing this function is quite simple. All that is needed is to assert that the current sequence of characters matches format
and then return.
tokens::FORMAT => Ok(tokens::Token::new(
tokens::IdentifierKind::FORMAT,
String::from(ident),
self.line_number,
self.position,
)),
Custom functions
As mentioned above, the fn
keyword marks the beginning of a function definition. The task of tokenizing therefore boils down to, when
the current character in focus is f
, we check if the following character is n
. We will define a closure inside the body of
the extract_token_from_alphabet
function.
let is_func_identifier_fn = |next_char: &str| {
let current_char = self
.input
.get(self.position..self.read_position)
.unwrap()
.as_bytes()
.first()
.unwrap();
let next_two_step_char = next_char_fn(self.position + 0x2, self.read_position + 0x2, tokens::NULL);
(*current_char == 0x46 || *current_char == 0x66)
&& (*next_char.as_bytes().first().unwrap() == 0x6e
|| *next_char.as_bytes().first().unwrap() == 0x4e)
&& (next_two_step_char == tokens::LPAREN)
};
A lot is going on here. Let’s simplify it. is_func_identifier_fn
closure receives an argument next_char
. We expect that next_char
is an n
character.
In Line 1, we retrieve the current character. In this context, we expect the current character
to be an f
character. In Line 10, we retrieve a character two steps forward from the current position. In this context, we expect
that character to be an LPAREN
. Line 10 to Line 13 is where the “magic” happens. We convert and compare their Unicode representation for the characters we have just retrieved.
Unicode is a universal standard for encoding character symbols.
| Character | Unicode in Hexadecimal |
| f | 46 | | F | 66 | | n | 6E | | N | 4E |
In Line 12 is the first condition where we check whether we have a lowercase f
or uppercase F
.. Line 13 and Line 14 are the second condition
where we check whether we have a lowercase n
or uppercase N
. Line 15 is the final condition where we check for the presence of left parentheses.
We have encountered a custom function definition if all three conditions are true. To finish, we introduce the closure into our main logic flow.
else if is_func_identifier_fn(next_char) {
let ident = self.input.get(self.identifier_start_read_position as usize..self.read_position + 1).unwrap();
if ident == tokens::FUNCTION {
self.identifier_start_read_position = -0x1;
self.position += 1;
self.read_position += 1;
Ok(tokens::Token::new(
tokens::IdentifierKind::FUNCTION,
String::from(ident),
self.line_number,
self.position,
))
} else {
self.unknown_token_error(ident)
}
}
With the above, when the is_func_identifier_fn
asserts the beginning of a function, it returns a token of kind FUNCTION
; otherwise, it returns
an error. It should be noted that the error return is placed to satisfy Rust. It will not run in actuality.
Generic function tokenization verification
Let’s continue to expand our lexer_tests
test suite. The first test will verify the syntax below.
fn() {}
After tokenization, we want the lexer to return a FUNCTION
, an LPAREN
, an RPAREN
,
ending with LBRACE
and RBRACE
. That is the expectation.
#[test]
fn should_read_func_0() {
let mut lx = Lexer::new(String::from("fn() {} "));
/* trimmed for readability */
}
When we ran the test, it passes just like we expect. The test validates that the lexer can correctly identify the building blocks of a proper Karis function.
Edge case verification
Let’s shake this up a little. We’ll define another test that uses func() {}
as it’s input. Clearly, from the syntax,
we can tell that func
is not a valid identifier in Karis. At first glance, we may assume that the lexer throws an error. But not quite.
Here is the reason why. Once a function is declared, we intend to “call” it somewhere in the source text.
An intent to call a function has a suffix of ()
, similar to how other programming languages call functions.
In this case, func
is identified as a CALLER
and not FUNCTION
. Let’s demonstrate this.
#[test]
fn should_read_func_1() {
let mut lx = Lexer::new(String::from("func() {} "));
assert_eq!(
lx.read_tokens().unwrap().token_type,
tokens::IdentifierKind::FUNCTION
);
}
The test fails. When we look into the test logs, it tells us that the lexer produced a CALLER
while we expected a FUNCTION
. In this
scenario, our assertion is incorrect. The test fails.
failures:
---- lexer::lexer_tests::should_read_func_1 stdout ----
thread 'lexer::lexer_tests::should_read_func_1' panicked at 'assertion failed: `(left == right)`
left: `CALLER`,
right: `FUNCTION`', lexer/src/lexer.rs:1363:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
failures:
lexer::lexer_tests::should_read_func_1
test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 32 filtered out; finished in 0.00s
Let’s now revert and complete the test. This time, our expected assertion matches the test. We’ll now add one more that receives input matching the correct syntax.
Function with arguments tokenization verification
Let’s now verify that the lexer can tokenize the syntax below.
let add @int = fn(x @int, y @int) { return x + y; }
We see that the function add
takes in two arguments, x
and y
, each of type @int
then returns the summation of the integers.
The expectation is that the lexer will tokenize the function and its arguments.
let add @int = fn ( x @int , y @int ) { return x + y ; }
The test follows the same pattern as all other tests we have written.
#[test]
fn should_read_func_3() {
let mut lx = Lexer::new(String::from("let add @int = fn(x @int, y @int) { return x + y; }"));
/* trimmed for readability */
assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::FUNCTION);
assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::LPAREN);
assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::VARIABLE);
assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::INTTYPE);
assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::COMMA);
assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::VARIABLE);
assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::INTTYPE);
assert_eq!(lx.read_tokens().unwrap().token_type,tokens::IdentifierKind::RPAREN);
/* trimmed for readability */
}
Line 4 to Line 11 are the focal points of the test. Every assertion statement validates individual building blocks of a function in sequence as the lexer expects them to be ordered.
Non-existent tokens
Our lexer can handle tokens that we have specified. But what if we pass it a syntax that has a token not defined? Ideally, we should not proceed with tokenization. We have two options:
- Throw the error and quit the program entirely
- Pass the error to the caller
For our purposes, we will choose option two. This is because the lexer is part of a larger program. We want each part
of the program to propagate the error to the caller instead of crashing. This mechanism will allow us to handle the error
correctly and give better error messages. Rust already gives us the means to propagate such errors. In the definition of read_tokens
,
we return Result<tokens::Token, errors::KarisError>
. The caller will then use Rust’s ?
symbol to signal to propagate the error.
We will see more of this in action. Let’s add a test to tokenize ~
. We expect the lexer to return an error.
#[test]
fn should_return_err() {
let mut lx = Lexer::new(String::from("~"));
assert!(lx.read_tokens().is_err(),);
// ^ we want an error
}
cargo test lexer::lexer_tests::should_return_err
It works as expected. Let’s add another one for %.
#[test]
fn should_return_err_2() {
let mut lx = Lexer::new(String::from("%"));
assert!(lx.read_tokens().is_err(),);
// ^ we want an error
}
We expect an error similar to the previous test. However, is that the correct behaviour? NO. In computer arithmetic, %
is interpreted as modulus. Given variables on either side, it will return the remainder. At the minimum, Karis ought to support this behaviour. However, our tokens definition is missing the %
entry. Let’s add it. We’ll begin from tokens.rs
by adding:
pub const MODULUS: &str = "%";
Next, in the IdentifierKind
enum, we add:
MODULUS, // "%"
Finally, we introduce a check for the new token in lexer.rs
.
tokens::MODULUS => Ok(tokens::Token::new(
tokens::IdentifierKind::MODULUS,
ch_owned,
self.line_number,
self.position,
)),
This time, when we run the test, it will fail because we expect an error while the lexer successfully tokenizes %
. Let’s modify the test to suit its true intent and run it.
#[test]
fn should_read_modulus() {
let mut lx = Lexer::new(String::from("%"));
assert!(lx.read_tokens().is_ok(),);
}
Multi-line syntax
We’ll move to verify that the lexer can correctly work on multi-line input. As we are used to other programming languages, we write programs into text files with the expectation that the language interpreter or compiler will consume the text files and run the program. This is the same for Karis. It is, therefore, prudent to have a lexer that consumes characters served from the file system.
Generic multi-line syntax
We’ll start with something simple —a syntax with only variable declarations.
let sum @int = 1 + 2;
let person_age @int = 25;
let is_active @bool = true;
let is_live @bool = false;
We see that there are a lot of whitespaces. That ok. It’s the very essence of supporting multi-line syntax. In our lexer, whitespace is just a separator. If the lexer is implemented correctly, it should tokenize the above syntax without a hiccup. Add the test below:
#[test]
fn should_read_multiline0() {
let mut lx = Lexer::new(String::from(
"
let sum @int = 1 + 2;
let person_age @int = 25;
let is_active @bool = true;
let is_live @bool = false;
",
));
/* trimmed for readability */
}
Multi-line syntax : function definition and call
We’ll now verify the lexer with input that resembles a program that may be written in Karis. It’s simple. We define a function
add
that takes two arguments of type int
and returns a summation. Then we call the function. The syntax will look like the one below:
let add @int = fn(x @int, y @int){
return x + y;
};
add(1,3);
Similar to the previous attempt, there are a lot of whitespaces. Add the test below.
#[test]
fn should_read_multiline1() {
let mut lx = Lexer::new(String::from(
"
let add @int = fn(x @int, y @int){
return x + y;
};
add(1,3);
",
));
/* trimmed for readability */
}
Multi-line syntax : function definition with conditional statements
Next is a relatively similar function with conditional statements.
#[test]
fn should_read_multiline2() {
let mut lx = Lexer::new(String::from(
"
let greater @int = fn(x @int, y @int) {
if x > y {
return x;
}
return y;
}; ",
));
/* trimmed for readability */
}
Multi-line syntax : program entry point
Here we need to verify that a program’s entry point will be tokenized correctly. Recall that Karis programs are executable by default. Therefore, they need an entry point that collects all definitions within it before the execution process begins. A sample program looks like the one below.
We expect the lexer to correctly know where a program begins and ends and recognize identifiable entities. Let’s add a test for it.
#[test]
fn should_read_multiline_main() {
let mut lx = Lexer::new(String::from(
"
@main fn(){
let x @int = 5;
let name @string = \"Karis\";
}@end;
",
));
/* trimmed for readability */
}
We are now done with all tests. At this point, we can confidently state that the lexer is complete. But nothing is ever really done in software development. At some point, we’ll revisit the lexer. Not to change it but to extend it.
The console crate setup
We introduce a new program called console
. The name is intentional as it will contain APIs to interact with the compiler. The end-user will be
able to issue commands console
to run some workflow in the compilation process. The console
program will be a new crate in our workspace. We create
it by using the cargo
that comes with Rust.
cargo new console
The command creates the new crate. The output indicates that we need to add the newly created crate to be part of the current workspace otherwise it won’t
be able to reuse the Cargo.lock
file in the root of the workspace. To add console
to the workspace, we amend the Cargo.toml
in the root of the workspace
by appending console
to the members’ list.
[workspace]
members = ["lexer","errors", "console"]
While the name console
fits our purpose by making it easy to think about what we are creating, what we need is a name that the end user can associate with
Karis programming language and invoke at any time. Let’s demonstrate this notion with Python programming language. The language comes with a command python
that can be typed into the terminal to start interacting with the language. That is what we need for Karis. To accomplish this change, we need to change the Cargo.toml
inside the console directory. Let’s open the file and add the following entry:
[[bin]]
name = "karis"
path = "src/main.rs"
The entry tells the Rust compiler that the binary of the console
program should be addressed as Karis
.
Lets verify the changes works correctly by running the following command:
cargo install --path . && karis
The command compiles, installs and runs the program. The output Hello, world!
indicates that the program can now be referenced with the name we desire.
Let’s proceed to introduce dependencies for the console
crate. Open Cargo.toml
in the console directory and add the following entries under the dependencies
section:
clap = { version = "3.2.12", features = ["cargo"] }
lexer = { path = "../lexer" }
errors = { path = "../errors" }
We have encountered the last two entries. The first entry, clap
is useful to allow us to develop a command-line application which is exactly what we need.
For our current use case, we will build a Read-Lexer-Print-Loop
application. As the name indicates, the application will receive input and perform
lexing operations on it as it prints to the terminal.
let matches = Command::new(KARIS_WELCOME_MESSAGE)
.version("v0.1.0")
.propagate_version(true)
.subcommand_required(true)
.arg_required_else_help(true)
.subcommand(
Command::new("rlpl")
.about("Read-Lexer-Print-Loop extracts tokens and prints on stdout")
.arg_required_else_help(true)
.arg(
Arg::new("interactive")
.short('i')
.long("interactive")
.action(ArgAction::SetTrue)
.required(false),
)
.arg(arg!(-p --filepath <PATH>).required(false)),
)
.get_matches();
In Line 1, we bootstrap the application by instantiating the Command
type from clap
. The application allows us the flexibility to add custom
attributes. In Line 2, we state the version of the application as v0.1.0
. In Line 6, we introduce an instance of a subcommand that the application
will receive from the user. In this instance, the subcommand is called rlpl
. We want the subcommand to receive arguments that will direct how it’s operation
run. In Line 10 to Line 17, we indicate that the subcommand should expect either interactive
and filepath
arguments.
The command below demonstrate how the program may be run:
karis rlpl -interactive --filepath <path/to/karis/program/file>
Next, we will add an intercept implementation that will listen to sub-commands and execute them appropriately.
match matches.subcommand() {
Some(("rlpl", sub_matches)) => {
let interactive = sub_matches.get_one::<bool>("interactive");
let file_path = sub_matches.get_one::<String>("filepath");
if let Some(i) = interactive {
if *i {
return lexer_interactive();
}
}
if let Some(file_path) = file_path {
return lexer_from_file(file_path);
}
Ok(())
}
_ => {
println!("Nothing to do");
Ok(())
}
}
From Line 6 to Line 10, when the interactive
argument is provided, we invoke the lexer_interactive
function. When we say “interactive”, we mean the user
who calls the subcommand can invoke more than one command in a single session. The equivalent is true when filepath
argument is provided as seen from Line 12 to Line 14. Let’s go further into the interactive
function.
fn lexer_interactive() -> Result<(), KarisError> {
println!("{}\n", KARIS_INTERACTIVE_MESSAGE);
println!("Copy-Paste or type your Karis program after the prompt\n",);
println!("Type :exit to close the console\n",);
let mut input = String::new();
loop {
print!("{} ", PROMPT);
io::stdin().read_line(&mut input)?;
let text = input.trim();
if text.is_empty() {
println!("Nothing to scan \n");
} else {
if text == EXIT {
println!("Closing interactive console. Catch you later :) \n");
process::exit(0)
}
let mut lx = lex::Lexer::new(String::from(text));
lx.generate_and_print();
}
input.clear();
}
}
The block from Line 6 to Line 25 is the core of the implementation. The loop
runs infinitely, reading input from the standard input and then passing it to
the lexer on Line 20. The lexer_from_file
function works in a similar manner with the only exception that it reads from a file in the filesystem. Let’s
build and run the rlpl
. Run the following command:
cargo run --bin=karis -- rlpl -p testdata/00/test00.kr
Perfect. It generates and prints tokens as expected.