To build a programming language
by Peter Simard · 07/06/2008 (6:19 pm) · 4 comments
This is not a typical game development blog. I recently took the past few days to try to design a programming language from scratch. Why would I do this? Mostly to see if I could.
How a language actually works has always fascinated me. Going from symbols and words, into computer logic is a somewhat magical step. I wanted to see if I could create a programming language from scratch with nothing but intuition to guide me.
After about 3 days of tinkering, I have a working prototype aptly named "SimardBasic". It is written in C#. This is not a compiled language, but an interpreted language. C# is used to parse the raw code, and generate the necessary structures to run a program. It is also used for low level arithmitic and user interface.
Parsing the code
In order to get the code into a usable format, the code is broken down into individual statements. An example statement might be:
By placing all the statements into an array, they can be iterated though as the program runs.
Statements are then parsed into their individual syntax components. Using a hierarchy of objects to represent the types of syntax. All syntax derive from the base Syntax class. Some example derived syntax classes:
Once you have the entire program parsed into statements (which are in turn parsed into individual syntax) it is a matter of looping though them and applying the proper operation to them.
RegEx
Regular Expressions are an extremely powerful tool. Without them, parsing the raw code would be factors more difficult. Take for example this expression:
This is used to parse the code for any numerical digits. Without using regular expressions, you would be forced to examine each character to ensure it is valid. You will also notice there are true/false strings inside of there. Internally true and false are converted into 1 and 0 respectively. The ending (?!\w) will allow numbers to end in a symbol or space, but not a letter (for example, 100+ works, but 100q will fail).
For more information on regular expressions, I highly recommend picking up Mastering Regular Expressions
Syntax evaluation
This was one of the trickier components to write. Statements consist of a keyword (or variable), followed by some syntax that must be evaluated. The syntax can come in any form, and in the end, must end up with a single value. For example:
The evaluator must take ($var + (2 * 8)) @ "widgits\n" and perform the necessary operations, in the correct order to obtain a single value, which can be used by the statement parser (in this case, to be assigned to a variable).
Internal Functions
Adding internal functions is easy using the C# Invoke() function. Here is an example internal function:
Missing Functionality
This is very much a bare bones language at the moment. Some core functionality that is missing:
- Arrays
- For() statement
- User created functions
- Automatic order of operations (have to use parenthesis right now)
- Only basic stdin/out
Because this was a learning project, I am releasing the full source code. I don't know how useful any of it will be, but feel free to use it any way you wish.
My next side project will be doing some research on the "proper" way to construct a language. I suspect there are several core assumptions that are incorrect.
Source
www.crownsofpower.com/SimardBasic.zip
Now back to game design!
How a language actually works has always fascinated me. Going from symbols and words, into computer logic is a somewhat magical step. I wanted to see if I could create a programming language from scratch with nothing but intuition to guide me.
After about 3 days of tinkering, I have a working prototype aptly named "SimardBasic". It is written in C#. This is not a compiled language, but an interpreted language. C# is used to parse the raw code, and generate the necessary structures to run a program. It is also used for low level arithmitic and user interface.
Parsing the code
In order to get the code into a usable format, the code is broken down into individual statements. An example statement might be:
$var = 10; or print "The var is " @ $var;
By placing all the statements into an array, they can be iterated though as the program runs.
Statements are then parsed into their individual syntax components. Using a hierarchy of objects to represent the types of syntax. All syntax derive from the base Syntax class. Some example derived syntax classes:
Syntax_String - Contains a string of chars encapsulated by double quotes.
Syntax_Variable - A variable. Each Syntax_Variable shares the same memory space for the value of the variable.
Syntax_Symbol - Operaters such as ==, >, @, +
Syntax_ControlBrace - Left and right brackets { }. Controls code depth
Syntax_Parenthesis - Used for grouping syntaxOnce you have the entire program parsed into statements (which are in turn parsed into individual syntax) it is a matter of looping though them and applying the proper operation to them.
RegEx
Regular Expressions are an extremely powerful tool. Without them, parsing the raw code would be factors more difficult. Take for example this expression:
"(\d+.\d+|\d+(?!\.)|true|false)(?!\w)"
This is used to parse the code for any numerical digits. Without using regular expressions, you would be forced to examine each character to ensure it is valid. You will also notice there are true/false strings inside of there. Internally true and false are converted into 1 and 0 respectively. The ending (?!\w) will allow numbers to end in a symbol or space, but not a letter (for example, 100+ works, but 100q will fail).
For more information on regular expressions, I highly recommend picking up Mastering Regular Expressions
Syntax evaluation
This was one of the trickier components to write. Statements consist of a keyword (or variable), followed by some syntax that must be evaluated. The syntax can come in any form, and in the end, must end up with a single value. For example:
$foo = ($bar + (2 * 8)) @ "widgits\n";
The evaluator must take ($var + (2 * 8)) @ "widgits\n" and perform the necessary operations, in the correct order to obtain a single value, which can be used by the statement parser (in this case, to be assigned to a variable).
Internal Functions
Adding internal functions is easy using the C# Invoke() function. Here is an example internal function:
// WHILE statement
public void spWHILE(Statement statement)
{
// Evaluate the syntax inside the while parenthesis
string evalResult = statement.evalRemainingSyntax();
if (!isTrue(evalResult))
{
// Eval was false, ignore the next statement (usually a control brace)
ignoreNextStatement = true;
}
else
{
// Eval was true, tell the interpreter where to return to
loopReturnStatement[currentControlDepth] = statement;
}
}Missing Functionality
This is very much a bare bones language at the moment. Some core functionality that is missing:
- Arrays
- For() statement
- User created functions
- Automatic order of operations (have to use parenthesis right now)
- Only basic stdin/out
Because this was a learning project, I am releasing the full source code. I don't know how useful any of it will be, but feel free to use it any way you wish.
My next side project will be doing some research on the "proper" way to construct a language. I suspect there are several core assumptions that are incorrect.
Source
www.crownsofpower.com/SimardBasic.zip
Now back to game design!
#2
07/06/2008 (9:35 pm)
IBM's articles on lex/yacc are quite good.
#3
I recommend some parsing tool that uses GLR or backtracking LALR, but don't know of one for C#. Otherwise, ANTLR is definitely a good choice for this target language.
Regular expressions really won't do for language parsing since they can't maintain context. It's impossible, for example, to parse something like 'if' expressions with regexps.
For digging deeper into this, I really recommend reading the standard text "Compilers - Principles, Techniques, and Tools" by Aho et al. It's a *really* good textbook and covers all the fundamentals.
07/07/2008 (11:15 am)
Hmm, Lex+YACC *used* to be the proper choice. The art of parsing has advanced a lot and today there are much more powerful and easier to use tools around. No need to separate tools for lexical analysis and syntax analysis anymore, for example.I recommend some parsing tool that uses GLR or backtracking LALR, but don't know of one for C#. Otherwise, ANTLR is definitely a good choice for this target language.
Regular expressions really won't do for language parsing since they can't maintain context. It's impossible, for example, to parse something like 'if' expressions with regexps.
For digging deeper into this, I really recommend reading the standard text "Compilers - Principles, Techniques, and Tools" by Aho et al. It's a *really* good textbook and covers all the fundamentals.
#4
More about it: http://en.wikipedia.org/wiki/GOLD_(parser)
Official page: http://www.devincook.com/goldparser/
Biased comparison between gold parser and other options: http://www.devincook.com/goldparser/about/comparison-parsers.htm
We are using it on a workflow product, used by large companies (a bank, a mining company, telecoms). Unfortunatelly, we still feel like there should be better products out there to do this kind of work more easily, but until now gold parser proved to be the best option...
I would like to hear otherwise since it has no graphical editor... ^^ It is all defined in text form and the 'debug' is done on the parsing tree itself, not always providing enough information to solve common problems...
07/08/2008 (4:37 pm)
In the company I work for we use "Gold Parser" (LALR), that is freeware. You basically only need to learn how to define commands in gold parser and to choose one of the supported languages (including C#) to make a new language on it.More about it: http://en.wikipedia.org/wiki/GOLD_(parser)
Official page: http://www.devincook.com/goldparser/
Biased comparison between gold parser and other options: http://www.devincook.com/goldparser/about/comparison-parsers.htm
We are using it on a workflow product, used by large companies (a bank, a mining company, telecoms). Unfortunatelly, we still feel like there should be better products out there to do this kind of work more easily, but until now gold parser proved to be the best option...
I would like to hear otherwise since it has no graphical editor... ^^ It is all defined in text form and the 'debug' is done on the parsing tree itself, not always providing enough information to solve common problems...

Torque Owner Gary "ChunkyKs" Briggs
Lex and Yacc are your friends, GIYF.
I cannot strongly enough suggest taking the time to learn Lex and Yacc [You'll find yourself using flex and bison pretty soon, just by the way]. They're pretty much the "right" way to do language stuff
Gary (-;