What is an Abstract Syntax Tree?
(Let me preface this by saying that syntax trees are hardly the only way to write a translator; but they’re broadly used because of their simplicity, versatility, and flexibility.)
“Abstract” as in abstraction. An abstract syntax tree (henceforth AST) is the manner in which we will be encoding every command of our source and object programs. Think of it as an internal representation of the semantics, derived from the original syntax, and encodable in the final syntax.
The biggest trick with an AST is to remember not to confuse it for its representation. Syntax and semantics are elemental, in the old-school black-magic Aleister Crowley sense. You can’t even talk about one without comparing it to the other. This is, in my personal analysis, where a lot of the works-like-magic notion about compilers comes from; but there’s nothing magical about it.
To understand this, think about what happens when you’re speaking with another human being. The art of communication is the act of transference of a thought from one mind to another, through some mutually external medium; be it speech, or writing, or hand gestures. When someone begins to speak, and you begin to listen, your brain (operationally) initially considers every single possible thing that the speaker is about to say. As they pronounce the phonemes and syllables and words of their sentence, the possibilities narrow, and different semantic clusters in your mental network go dark. Once they’ve finished their statement, given that it was well-formed, you have a reasonably precise notion of what it was that was in their mind, which they are trying to convey to you. You can duplicate it, within reason. If it was ill-formed, you might only have a vague, or even a holistically incorrect, notion.
This is the concept of an AST; the AST is the thought itself. It is a (reasonable) binary representation of the program data. I used to think of ASTs as being something like a sentence diagram; but this isn’t really the case, and it in fact held me back for some time. My script grammars were limited to only mild variations of the structure of the language they were written in. ASTs are not grammatical structures; they are more implicit than that. This is the greatest reason why you must not confuse what an AST is for what an AST might look like; it’s the difference between using it like one would a first language, and a second language.
I’m assuming that you have already read the chapter on building an efficient Scanner. I’m also encouraging you to keep the Glossary open in another tab, as I may accidentally jump ahead in my definitions in an early edition of this piece. (Promise, I’ll review it with a giant whiteboard and lots of draft-mechanics before I declare the series finished; you are always free to scream at me about it in the comments.)
So, as an example, let’s consider basic English. Actually, let’s consider Bad English, because I’m not willing to jot down every single rule for a language with a lexicon three times larger than any other western language. Bad English will have a limited set of rules, and will generate a set of only a few basic sentences. For further simplicity, they will all be declarative, much like a programming language.
The first step, before you even begin to piece together an AST, is to write down the rules of your language in a formal specification; that is, a specification with a distinct set of rules to its grammar, which often must be known well by the reader, and allows for little chance of misinterpretation. An informal specification is usually presented in natural language, like this article; it is understood by a much wider audience, but is subject to considerable misinterpretation in contrast.
I recommend Backus-Naur Form (BNF). BNF has a number of translators written for it, often involving meta-programming which generates usable source code. I won’t be using any here, but they do exist. The chief notion for writing it is for you, particularly as you are creating a new language.
Primary BNF looks something like this.
Sentence ::= Subject Verb Object . Subject ::= I Subject ::= a Noun Subject ::= the Noun Object ::= me Object ::= you Object ::= a Noun Object ::= the Noun Noun ::= dog Noun ::= stick Noun ::= hog Verb ::= ate Verb ::= caught Verb ::= bought
Before I go further, I should discuss the concept of a terminal, and nonterminal, symbol. All of the words in the above definition are some form of symbol; a terminal symbol can be parsed no farther. A nonterminal symbol can be broken down. Any program syntax consists of a finite set of nonterminal symbols, and a finite set of terminal symbols. In the example above, “Sentence” is a nonterminal symbol, as it can be inspected to retrieve a Subject, a Verb, and an Object. “I” and “dog” are terminal symbols, as they cannot be broken down further.
Note that the BNF structure isn’t ideal yet. The “::=” is pronounced “may consist of”, and means as much. A sentence is a subject, followed by a verb, followed by an object, followed by a terminal period. (Subject-verb without object is a syntax error in Bad English.) It is thus represented as “Sentence := Subject Verb Object .” So what is a Subject? A subject may consist of “I”, may consist of “a Noun”, and may consist of “the Noun”. A Noun may consist of “dog”, “stick”, or “hog”.
A simple AST would be a Sentence tree, referencing a Subject tree, a Verb tree, and an Object tree, with an implicit period. You might already see how this could be parsed to create a sequence of generated functions, regular expressions, and switch-case statements, which would work as a parser and produce a Sentence tree. But remember how I said that the BNF was not complete? While grammatically correct and certainly easier to form, the biggest issue with making this deterministic is the repeated definitions.
You need to take something like the four definitions for “Object” and condense them into one. In BNF syntax, this is usually done with a vertical bar “|”, meaning “or”. Condensing the nonterminal definitions, we have the much more palatable:
Sentence ::= Subject Verb Noun . Subject ::= a Noun | the Noun | I Object ::= me | you | a Noun | the Noun Noun ::= dog | stick | hog Verb ::= ate | caught | bought
Much nicer, yes? But still not ideal. BNF has been around for a while; it actually goes back as far as ALGOL 58, in 1959. He was looking at Noam Chomsky’s work with context-free grammars, The first (arguable) regular expressions were actually developed in 1956 by Stephen Keene, but they were pretty primitive. Much of the format of BNF served as a major influence on them.
The above works fine for such a simple language as Bad English, as read by a human being with our own translating and interpreting internal hardware. For a program, the simpler the better, remembering that the most important factor going into BNF is preservation of syntax. Consider this instead:
Sentence ::= Subject Verb Object . Subject ::= (a | the) Noun | I Object ::= (a | the) Noun | me | you Noun ::= (d|h)og | stick Verb ::= ate | (ca|bo)ught
This is perfectly valid, and even more concise. Remove the whitespace, and any one of them can easily be condensed into a working regular expression. More importantly, redundancies become very easy to catch on implementation.
So How Do I Use It?
When you are constructing a language compiler or interpreter, especially from the ground up, always write these out first. Often, it will be exclusively for your own reference. Everything you’re about to do with the Parser module is encapsulated in that BNF-formatted data.
I have an abbreviated format, which I usually use, which basically has the type name, “::=”, and a regular expression for the sequence of elements that this type contains. There are ups and downs to using regular expressions directly; the biggest issue that I run into is that single-member may-consist-ofs (such as, “Identifier ::= Numeric | Nominal” can easily be implemented by simply having the classes for Numeric and Nominal extend a class for Identifier. Then it’s just a question of «expected class».isAssignableFrom(«provided class»). Ultimately, if you can comfortably implement it, then yes, you can do it. I’ll be showing you another way, which I feel is often a cleaner implementation.
Does This Have Something to Do With Your Caveat On the Scanner?
Why yes. Yes it does.
It’s much more convenient for a Parser to simply handle ASTs. Having it handle items which may be either ASTs, or Strings, is extremely difficult in Java. Give me one more chapter to explain the details of “Terminal” and “NonTerminal” ASTs, and then I’ll show you what needs to happen to your Scanner.
I promise it isn’t big.
[To Be Completed]