RSS

Software Language Engineering: Terminals, Nonterminals, and Bears (Early Edition)

18 Nov

The Notion of a Terminal AST

So, in the last chapter, I explained what an AST was structurally. There are formally two kinds of extensions to it. I usually implement them in their own classes, extending the base class of AST.

The first is the terminal. If you’ve programmed in Java for even a month, you know that having a method which accepts two different kinds of unrelated classes in the stead of one another is a bad idea for all kinds of reasons.

It is, actually, possible.

That just doesn’t mean that you should do it.

ASTs are formally called trees, but what they are is nodes on a tree. A program is a single AST, with an extension typically called “Program” as the overarching root node. The branch nodes, or nodes with child nodes, are called non-terminal nodes; the end points are called terminal nodes.

Each of those tokens that your Scanner kicked out? Yeah, that’s a terminal node, disguised as a String.

Let me offer you some of my code for ASTs, Terminals, and NonTerminals. (As before, there are major issues that I’m leaving out until later on. See if you can catch them.)

package oberlin.builder.parser.ast;

import oberlin.builder.codegenerator.RuntimeEntity;
import oberlin.builder.visitor.*;

/**
 * Abstract Syntax Tree, capable of representing any sequence of 
 * statements or the entire program.
 * 
 * @author © Michael Eric Oberlin Nov 3, 2014
 *
 */
public interface AST {
    /**
     * @return number of sub-elements contained in this tree node.
     */
    public int getElementCount();
}
package oberlin.builder;

import oberlin.builder.parser.ast.AST;

/**
 * Basis of all complete abstract syntax trees. Terminals are basically isolated-tokens known only by their spellings.
 * 
 * @author © Michael Eric Oberlin Nov 5, 2014
 *
 */
public class Terminal implements AST {
    private final String spelling;
    
    public Terminal(String spelling) {
        this.spelling = spelling;
    }
    
    public final String getSpelling() {
        return this.spelling;
    }

    @Override
    public String toString() {
        return getSpelling();
    }

    @Override
    public final int getElementCount() {
        return 1;
    }
}

Let those soak in a little. Note that each Terminal has a length of one, meaning that it is the only member of its immediate tree. That will be important when we develop our Parser.

A terminal is an instance of an AST, and can be created by simply passing its token to it. The token is stored in the field “spelling”. Terminal is also fully extensible, as even though their token is consistently their only members, there is a significant difference between an equals operator, a binary operator, and numerical data; and nonterminal nodes take that difference quite seriously.

The Notion of a Nonterminal AST

A nonterminal AST is an AST built not from characters in a String, but from a sequence of other ASTs. The constituent ASTs can be terminals, or nonterminals. Remember BNF? Ever listed item before may-consist-of (::=) was a nonterminal. In instances of single-member representation, such as:

Noun ::= ProperNoun

“Noun” is still a nonterminal, as the implication is that (at least in theory) multiple types of items can be its constituents.

The “Program” node is of course always a nonterminal. I’ve written a nice slice of code for them, too.

package oberlin.builder;

import java.util.*;

import oberlin.builder.parser.ast.AST;

public abstract class NonTerminal implements AST {
    private final List<AST> astList;
    
    public NonTerminal(AST... astList) throws MismatchException {
        if(!checkTypes(astList)) throw new MismatchException("Nonterminal class " + this.getClass() + " does not match " +
                "expression.");
        
        List<AST> list = new ArrayList();
        for(AST ast : astList) {
            list.add(ast);
        }
        this.astList = list; 
    }
    
    public NonTerminal(List<AST> astList) throws MismatchException {
        try {
            this.astList = resolveTypes(astList);
        } catch(BuilderException ex) {
            throw new MismatchException(ex);
        }
    }
    
    public abstract List<Class<? extends AST>> getExpectedASTTypes();
    
    /**
     * Check to see that all provided ASTs are some extension of the expected class of AST,
     * and create the internal list of ASTs from it if possible.
     * 
     * @param astList current list of program ASTs 
     * @return true if the first ASTs match the expected ones, false otherwise
     */
    private List<AST> resolveTypes(List<AST> astList) throws BuilderException {
        List<AST> ownASTs = new ArrayList<>();
        List<Class<? extends AST>> astTypes = getExpectedASTTypes();
        
        for(int i = 0; i < astTypes.size(); i++) {
            Class<? extends AST> provided, expected;
            provided = astList.get(i).getClass();
            expected = astTypes.get(i);
            if(!expected.isAssignableFrom(provided)) {
                throw new BuilderException("Cannot get " + expected + " from " + provided.getClass());
            }
            ownASTs.add(astList.get(i));
        }
        return ownASTs;
    }
    
    /**
     * Check to see that all provided ASTs are some extension of the expected class of AST.
     * 
     * @param astList current array of program ASTs 
     * @return true if the first ASTs match the expected ones, false otherwise
     */
    private boolean checkTypes(AST... astList) {
        List<Class<? extends AST>> astTypes = getExpectedASTTypes();
        
        for(int i = 0; i < astList.length; i++) {
            Class<? extends AST> provided, expected;
            provided = astList[i].getClass();
            expected = astTypes.get(i);
            if(!provided.equals(expected)) {
                return false;
            }
        }
        return true;
    }
    
    @Override
    public int getElementCount() {
        return astList.size();
    }
}

This is where the critical detail is left out of my code. If you already see it, then congratulations! In any case, I’ll let you all in on it at the end of the chapter. It goes back to the advantage of regular expressions over manually checking lists.

In “Programming Language Processors in Java”, David A. Watt and Deryck F. Brown use a visitor pattern to scan for these. That’s a perfectly valid approach. I get the same advantages through my constructor-or-exception pattern, which you have seen before. In fact, if you’re careful, you may notice that it’s another format of the same general pattern; without the traditional Visitor and Element classes. In my github repo, you might notice that I have the outline of a Visitor pattern implemented on these. Pay it no mind, it is proving unnecessary.

Still a good idea. Just unnecessary.

What Do They Mean to a Parser?

A Parser’s job is thus. Iterate through a list of Terminal syntax tree nodes. Compare the beginning of the list to established members of a list of nonterminal nodes. Find a match.

Truncate the matched beginning of the list of ASTs; this is, remove every element that is a member of the new NonTerminal. Now that they are removed, append that NonTerminal to the beginning of the list, in their place. Repeat this process until the list of ASTs is of size one, and return that singular AST.

And of course, if the program cannot be parsed down to a singular AST, throw a Parser Error.

Caveats

These are meant to be identified through members of an enumeration; just as tokens were. However, you might remember BNF items such as:

Command ::= Command+

That is, a variable number of Command nodes can be condensed into the immediate tree of a single Command node. This is an issue for a simple list of ASTs, as in its provided implementation, NonTerminal simply attempts to match a fixed sequence of AST types to a provided list of ASTs, one by one. This is almost always inadequate for a programming language.

I’m going to step around writing a whole new series on how regular expressions work. (For now.) The important detail is that a minimum and maximum number of items, extending a specific AST class, need to be matched. I was tempted to call these “Clauses”, but that just barely relates to the actual definition of “clause”; so instead, we’ll borrow from regular expression terminology and call them Groups.

The down side? I’m still implementing Groups. They will have their own chapter though.

Now, we learn how to build a Scanner. (Again. Sort of.)

Advertisements
 

Tags: , , , , , , , , , , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: