Tag Archives: Noam Chomsky

Software Language Engineering: Analysis

(Early Edition)

So, at this point we have a clearly defined Backus-Naur form for our grammar, a working scanner for terminal tokens, and a working parser for non-terminal generation and the construction of the abstract syntax tree.

The biggest hurdles are over. However, they weren’t quite the last. One thing that must be done before any compiler or interpreter can be built is the construction of a utility for analysis. In this section, I’ll be describing the basic contract for a contextual analyzer. In the next section, I’ll be showing you some example code.


Some of the analysis was already done. The core concept to keep in mind, when building a build tool, is the fewer the passes, the quicker your tool will run. Basic syntax errors have already been detected; as they prevent the construction of the AST for our language. However, you might notice that there are a few obtuse things that you can still do.

Enter “3 + 4 = 2” into the algebra tool, and you’ll notice that it will gobble it down just fine; even though it is concretely incorrect. This is where the second phase of analysis comes in.

Can we sweep for these while we generate the abstract syntax tree? Does your processor have more than one core? Then we absolutely can. Even if you are using a single-core, the penalty would be rather mild. However, it is important to recognize that code analysis is the work of a separate module.

Types of Analysis

There are two major forms of analysis to worry about: syntactic analysis, and contextual analysis.

Syntactic Analysis

Syntactic analysis is almost always the responsibility of the parser. Contextual analysis depends on it. Syntactic analysis is the process that generates the phrase structure of the program; without it, further phases are obliquely impossible. It’s more commonly known as parsing, and if you’re following this tutorial in sequence, you’ve already done it. If not, there are four preceding chapters dedicated to explaining it in detail.

Generally, I recommend, on the establishment of a syntax error during syntactic analysis, simply skipping the node and checking for what might be next. This is not an issue for small programs, much less one-line programs; but for larger utilities and libraries it is vanishingly uncommon for the number of bugs to be limited to one. Often, knowledge of the effect on another, later, mistake is critical to the creation of a satisfactory solution,

As a side effect of continuing the scan, the error reporter may have a hundred additional syntax errors to report, even though they all reference the same mistake. This can explode exponentially. Accordingly, for a final edition of a builder, it is best to limit the number of reported errors before the program calls it quits. For Javac, the limit is a hundred errors, unless the -Xmaxerrs and -Xmaxwarns flags are set to a higher value.

On the completion of syntactic analysis, without error, we have a singular tree with a single root node, most commonly called Program. If syntactic analysis does not complete properly, it is still possible to proceed to contextual analysis, but no further, as erroneous code has an arbitrary interpretation. Computers require determinism.

Contextual Analysis

So, as of contextual analysis, we have a complete abstract syntax tree. The remaining question is, does the correctly formed code also conform to the controls of the language? Is a symbol used before declaration, when the language demands that it not be? Is a variable used outside of its proper scope? Are there duplicate declarations, without any rule for how to handle those declarations? The general rule is that if you cannot establish the analysis rule in BNF, then it is contextual.

After the contextual analyzer has completed its task, given that there are no show-stopping errors, it returns an AST as well. In this case, it is what’s known as a decorated syntax tree. Every user-defined symbol will maintain a node reference to its declaration in the AST, Every expression, for a language concerned about type, is demarcated with its result type.

You may remember, from the introduction to Backus-Naur Form, that it was designed for “context-free grammars”. The term “contextual analysis” more literally means analyzing extensions to the grammar that supersede the domain of BNF.

The best way to think of a proper decorated syntax tree is as an abstract syntax tree, with which any node can be taken at random and read from beginning to end, which forms a complete, definite, and concrete statement.

Procedure of Analysis

Like every class, we must begin with a concrete description of its contract. This includes its responsibilities, and the resources made available to it. Its responsibility, in broad summary, is to find every occurrence of a contextual unknown and link it to its definition. Resources include the code itself as an abstract syntax tree, and a concrete error reporter.

Every analysis tool, the parser included, must be initialized with an error reporter. It is not recommended to make the error reporting functionality ingrained to the class, as it is often best the same error reporter used by parser (your syntax analyzer), and functionally, it has a very different contract—one class, for one responsibility.

We again apply the visitor pattern, much as we do for syntax analysis. Is it possible to use the same visitor pattern for both syntax analysis and context analysis? Technically, yes, but it is discouraged, as syntactical analysis and contextual analysis are two separate contracts. It is possible to feed the incomplete abstract syntax tree to a waiting context analyzer, but this is a tactic more sophisticated than we are ready for at this juncture. I’ll probably return to it in the final section.

To my knowledge, there is not yet a BNF-equivalent for non-context-free grammars that can easily be used for context analysis. This is not to say that there are none; if you insist on following the same pattern that you did for syntax analysis, you may consider Noam Chomsky‘s formal grammar. It uses a lot of unconventional symbols, so you may also consider getting accustomed to using a compose key.

As formal grammars, unless you are working with a set of people who are fully informed on their usage, go well outside of the bounds of this tutorial, I suggest considering the depth of complexity of your contextual grammar before resorting to them. What you will definitely need is a clear and inarguable description of what these rules are, even if it is in plain English.

The context analyzer will also, for most languages, be creating an identification table as it works. Perhaps your target language does not use variables, and has no need for one; I am assuming that it does. It is also possible that your target language does not mind late definitions, as long as there are eventually definitions. It would not be the first. For my algebra solver, I am currently assuming that it does mind; but later on, perhaps I’ll reformat it so that it doesn’t. Subsequent definitions, or even a loosely related concept called “late binding”, It isn’t as hard to do as you might initially think.

Summary Abstractions

We’ll need an abstraction of the core context analyzer. While I chose to call the syntax analyzer “Parser”, a more common term, there is no equivalent that I am aware of for the context analyzer. Thus, we’ll call it “ContextAnalyzer”. I propose a single method in ContextAnalyzer, called check(AST ast). This will initiate the visitor pattern.

Once I complete the code, I’ll highlight it to you in the next lesson.


Tags: , , , , , , , , , , , , , ,

Software Language Engineering: The Abstract Syntax Tree (Early Edition)

What is an Abstract Syntax Tree?

(Let me preface this by saying that syntax trees are hardly the only way to write a translator; but they’re broadly used because of their simplicity, versatility, and flexibility.)

“Abstract” as in abstraction. An abstract syntax tree (henceforth AST) is the manner in which we will be encoding every command of our source and object programs. Think of it as an internal representation of the semantics, derived from the original syntax, and encodable in the final syntax.

The biggest trick with an AST is to remember not to confuse it for its representation. Syntax and semantics are elemental, in the old-school black-magic Aleister Crowley sense. You can’t even talk about one without comparing it to the other. This is, in my personal analysis, where a lot of the works-like-magic notion about compilers comes from; but there’s nothing magical about it.

To understand this, think about what happens when you’re speaking with another human being. The art of communication is the act of transference of a thought from one mind to another, through some mutually external medium; be it speech, or writing, or hand gestures. When someone begins to speak, and you begin to listen, your brain (operationally) initially considers every single possible thing that the speaker is about to say. As they pronounce the phonemes and syllables and words of their sentence, the possibilities narrow, and different semantic clusters in your mental network go dark. Once they’ve finished their statement, given that it was well-formed, you have a reasonably precise notion of what it was that was in their mind, which they are trying to convey to you. You can duplicate it, within reason. If it was ill-formed, you might only have a vague, or even a holistically incorrect, notion.

This is the concept of an AST; the AST is the thought itself. It is a (reasonable) binary representation of the program data. I used to think of ASTs as being something like a sentence diagram; but this isn’t really the case, and it in fact held me back for some time. My script grammars were limited to only mild variations of the structure of the language they were written in. ASTs are not grammatical structures; they are more implicit than that. This is the greatest reason why you must not confuse what an AST is for what an AST might look like; it’s the difference between using it like one would a first language, and a second language.

I’m assuming that you have already read the chapter on building an efficient Scanner. I’m also encouraging you to keep the Glossary open in another tab, as I may accidentally jump ahead in my definitions in an early edition of this piece. (Promise, I’ll review it with a giant whiteboard and lots of draft-mechanics before I declare the series finished; you are always free to scream at me about it in the comments.)

So, as an example, let’s consider basic English. Actually, let’s consider Bad English, because I’m not willing to jot down every single rule for a language with a lexicon three times larger than any other western language. Bad English will have a limited set of rules, and will generate a set of only a few basic sentences. For further simplicity, they will all be declarative, much like a programming language.

The first step, before you even begin to piece together an AST, is to write down the rules of your language in a formal specification; that is, a specification with a distinct set of rules to its grammar, which often must be known well by the reader, and allows for little chance of misinterpretation. An informal specification is usually presented in natural language, like this article; it is understood by a much wider audience, but is subject to considerable misinterpretation in contrast.

I recommend Backus-Naur Form (BNF). BNF has a number of translators written for it, often involving meta-programming which generates usable source code. I won’t be using any here, but they do exist. The chief notion for writing it is for you, particularly as you are creating a new language.

Primary BNF looks something like this.

Sentence ::= Subject Verb Object .
Subject  ::= I
Subject  ::= a Noun
Subject  ::= the Noun
Object   ::= me
Object   ::= you
Object   ::= a Noun
Object   ::= the Noun
Noun     ::= dog
Noun     ::= stick 
Noun     ::= hog
Verb     ::= ate
Verb     ::= caught
Verb     ::= bought

Before I go further, I should discuss the concept of a terminal, and nonterminal, symbol. All of the words in the above definition are some form of symbol; a terminal symbol can be parsed no farther. A nonterminal symbol can be broken down.  Any program syntax consists of a finite set of nonterminal symbols, and a finite set of terminal symbols. In the example above, “Sentence” is a nonterminal symbol, as it can be inspected to retrieve a Subject, a Verb, and an Object. “I” and “dog” are terminal symbols, as they cannot be broken down further.

Note that the BNF structure isn’t ideal yet. The “::=” is pronounced “may consist of”, and means as much. A sentence is a subject, followed by a verb, followed by an object, followed by a terminal period. (Subject-verb without object is a syntax error in Bad English.) It is thus represented as “Sentence := Subject Verb Object .” So what is a Subject? A subject may consist of “I”, may consist of “a Noun”, and may consist of “the Noun”. A Noun may consist of “dog”, “stick”, or “hog”.

A simple AST would be a Sentence tree, referencing a Subject tree, a Verb tree, and an Object tree, with an implicit period. You might already see how this could be parsed to create a sequence of generated functions, regular expressions, and switch-case statements, which would work as a parser and produce a Sentence tree. But remember how I said that the BNF was not complete? While grammatically correct and certainly easier to form, the biggest issue with making this deterministic is the repeated definitions.

You need to take something like the four definitions for “Object” and condense them into one. In BNF syntax, this is usually done with a vertical bar “|”, meaning “or”. Condensing the nonterminal definitions, we have the much more palatable:

Sentence ::= Subject Verb Noun .
Subject  ::= a Noun | the Noun | I
Object   ::= me | you | a Noun | the Noun
Noun     ::= dog | stick | hog
Verb     ::= ate | caught | bought

Much nicer, yes? But still not ideal. BNF has been around for a while; it actually goes back as far as ALGOL 58, in 1959. He was looking at Noam Chomsky’s work with context-free grammars, The first (arguable) regular expressions were actually developed in 1956 by Stephen Keene, but they were pretty primitive. Much of the format of BNF served as a major influence on them.

The above works fine for such a simple language as Bad English, as read by a human being with our own translating and interpreting internal hardware. For a program, the simpler the better, remembering that the most important factor going into BNF is preservation of syntax. Consider this instead:

Sentence ::= Subject Verb Object .
Subject  ::= (a | the) Noun | I
Object   ::= (a | the) Noun | me | you
Noun     ::= (d|h)og | stick
Verb     ::= ate | (ca|bo)ught

This is perfectly valid, and even more concise. Remove the whitespace, and any one of them can easily be condensed into a working regular expression. More importantly, redundancies become very easy to catch on implementation.

So How Do I Use It?

When you are constructing a language compiler or interpreter, especially from the ground up, always write these out first. Often, it will be exclusively for your own reference. Everything you’re about to do with the Parser module is encapsulated in that BNF-formatted data.

I have an abbreviated format, which I usually use, which basically has the type name, “::=”, and a regular expression for the sequence of elements that this type contains. There are ups and downs to using regular expressions directly; the biggest issue that I run into is that single-member may-consist-ofs (such as, “Identifier ::= Numeric | Nominal” can easily be implemented by simply having the classes for Numeric and Nominal extend a class for Identifier. Then it’s just a question of «expected class».isAssignableFrom(«provided class»). Ultimately, if you can comfortably implement it, then yes, you can do it. I’ll be showing you another way, which I feel is often a cleaner implementation.

Does This Have Something to Do With Your Caveat On the Scanner?

Why yes. Yes it does.

It’s much more convenient for a Parser to simply handle ASTs. Having it handle items which may be either ASTs, or Strings, is extremely difficult in Java. Give me one more chapter to explain the details of “Terminal” and “NonTerminal” ASTs, and then I’ll show you what needs to happen to your Scanner.

I promise it isn’t big.

[To Be Completed]


Tags: , , , , , , , , , , , , ,