So, at this point we have a clearly defined Backus-Naur form for our grammar, a working scanner for terminal tokens, and a working parser for non-terminal generation and the construction of the abstract syntax tree.
The biggest hurdles are over. However, they weren’t quite the last. One thing that must be done before any compiler or interpreter can be built is the construction of a utility for analysis. In this section, I’ll be describing the basic contract for a contextual analyzer. In the next section, I’ll be showing you some example code.
Some of the analysis was already done. The core concept to keep in mind, when building a build tool, is the fewer the passes, the quicker your tool will run. Basic syntax errors have already been detected; as they prevent the construction of the AST for our language. However, you might notice that there are a few obtuse things that you can still do.
Enter “3 + 4 = 2” into the algebra tool, and you’ll notice that it will gobble it down just fine; even though it is concretely incorrect. This is where the second phase of analysis comes in.
Can we sweep for these while we generate the abstract syntax tree? Does your processor have more than one core? Then we absolutely can. Even if you are using a single-core, the penalty would be rather mild. However, it is important to recognize that code analysis is the work of a separate module.
Types of Analysis
There are two major forms of analysis to worry about: syntactic analysis, and contextual analysis.
Syntactic analysis is almost always the responsibility of the parser. Contextual analysis depends on it. Syntactic analysis is the process that generates the phrase structure of the program; without it, further phases are obliquely impossible. It’s more commonly known as parsing, and if you’re following this tutorial in sequence, you’ve already done it. If not, there are four preceding chapters dedicated to explaining it in detail.
Generally, I recommend, on the establishment of a syntax error during syntactic analysis, simply skipping the node and checking for what might be next. This is not an issue for small programs, much less one-line programs; but for larger utilities and libraries it is vanishingly uncommon for the number of bugs to be limited to one. Often, knowledge of the effect on another, later, mistake is critical to the creation of a satisfactory solution,
As a side effect of continuing the scan, the error reporter may have a hundred additional syntax errors to report, even though they all reference the same mistake. This can explode exponentially. Accordingly, for a final edition of a builder, it is best to limit the number of reported errors before the program calls it quits. For Javac, the limit is a hundred errors, unless the -Xmaxerrs and -Xmaxwarns flags are set to a higher value.
On the completion of syntactic analysis, without error, we have a singular tree with a single root node, most commonly called Program. If syntactic analysis does not complete properly, it is still possible to proceed to contextual analysis, but no further, as erroneous code has an arbitrary interpretation. Computers require determinism.
So, as of contextual analysis, we have a complete abstract syntax tree. The remaining question is, does the correctly formed code also conform to the controls of the language? Is a symbol used before declaration, when the language demands that it not be? Is a variable used outside of its proper scope? Are there duplicate declarations, without any rule for how to handle those declarations? The general rule is that if you cannot establish the analysis rule in BNF, then it is contextual.
After the contextual analyzer has completed its task, given that there are no show-stopping errors, it returns an AST as well. In this case, it is what’s known as a decorated syntax tree. Every user-defined symbol will maintain a node reference to its declaration in the AST, Every expression, for a language concerned about type, is demarcated with its result type.
You may remember, from the introduction to Backus-Naur Form, that it was designed for “context-free grammars”. The term “contextual analysis” more literally means analyzing extensions to the grammar that supersede the domain of BNF.
The best way to think of a proper decorated syntax tree is as an abstract syntax tree, with which any node can be taken at random and read from beginning to end, which forms a complete, definite, and concrete statement.
Procedure of Analysis
Like every class, we must begin with a concrete description of its contract. This includes its responsibilities, and the resources made available to it. Its responsibility, in broad summary, is to find every occurrence of a contextual unknown and link it to its definition. Resources include the code itself as an abstract syntax tree, and a concrete error reporter.
Every analysis tool, the parser included, must be initialized with an error reporter. It is not recommended to make the error reporting functionality ingrained to the class, as it is often best the same error reporter used by parser (your syntax analyzer), and functionally, it has a very different contract—one class, for one responsibility.
We again apply the visitor pattern, much as we do for syntax analysis. Is it possible to use the same visitor pattern for both syntax analysis and context analysis? Technically, yes, but it is discouraged, as syntactical analysis and contextual analysis are two separate contracts. It is possible to feed the incomplete abstract syntax tree to a waiting context analyzer, but this is a tactic more sophisticated than we are ready for at this juncture. I’ll probably return to it in the final section.
To my knowledge, there is not yet a BNF-equivalent for non-context-free grammars that can easily be used for context analysis. This is not to say that there are none; if you insist on following the same pattern that you did for syntax analysis, you may consider Noam Chomsky‘s formal grammar. It uses a lot of unconventional symbols, so you may also consider getting accustomed to using a compose key.
As formal grammars, unless you are working with a set of people who are fully informed on their usage, go well outside of the bounds of this tutorial, I suggest considering the depth of complexity of your contextual grammar before resorting to them. What you will definitely need is a clear and inarguable description of what these rules are, even if it is in plain English.
The context analyzer will also, for most languages, be creating an identification table as it works. Perhaps your target language does not use variables, and has no need for one; I am assuming that it does. It is also possible that your target language does not mind late definitions, as long as there are eventually definitions. It would not be the first. For my algebra solver, I am currently assuming that it does mind; but later on, perhaps I’ll reformat it so that it doesn’t. Subsequent definitions, or even a loosely related concept called “late binding”, It isn’t as hard to do as you might initially think.
We’ll need an abstraction of the core context analyzer. While I chose to call the syntax analyzer “Parser”, a more common term, there is no equivalent that I am aware of for the context analyzer. Thus, we’ll call it “ContextAnalyzer”. I propose a single method in ContextAnalyzer, called check(AST ast). This will initiate the visitor pattern.
Once I complete the code, I’ll highlight it to you in the next lesson.