RSS

Software Language Engineering: Establishing a Parser, Part One (Early Edition, Refurbished)

19 Nov

The Introductory Ramble

I suppose I’ve been building up to this thing for quite a while now.

Note that this is part one of the Parser. Were this a singular project, I might be long-since finished with it; tragically I’m not that much of a jerk and I’m trying to write a useful tutorial too. This means I need to get it right, or as right as possible, the first time; so please pardon delays in bringing it to you. I’m also avoiding items like JLex so that I can convey exactly how a build tool is meant to be structured, for the sake of education.

You see, back in the days when it was a responsible thing to call oneself, I used to be a journalist. Specifically, I was a field journalist and interviewer for something called a “newspaper”, which was an antiquated piece of low-grade paper with news printed on it. It was a strange and inconvenient system, but it was what we had before the internet. Anyway, both language and ethics are important to me out of habit, which is also perhaps why I’m not a news journalist anymore.

As a small foreword, there are a lot of things which we must keep in mind while developing a general-purpose parser, as the operations of a Parser on a List<AST> are a bit of a juggle. In spite of the “One Class, One Responsibility” principle, the environment that a Parser is typically injected into requires a great deal of attention. As such, I will be splitting this section into two parts, with a blurb in the middle on the visitor pattern (which you are welcome to skip if you are already familiar with it), and a blurb on using JavaFX to render graphical ASTs. The second piece will be a completion of the parser design. (Point being, once you start reading, expect to be reading for quite a bit.)

The Parser

To begin, we have a List<AST> full of all Terminals in the program, in order. NonTerminals are where a lot of the magic happens, and each of them contains several nodes which may be either Terminals, or further NonTerminals. The biggest concern with any translator is efficiency in building this tree.

I have attempted several methods in the construction of this tree. One of them involved a repurposing of java.util.regex, but I recommend against it now that it has been practiced. There are too many temptations to use forward references and a number of other tools, and they can have serious caveats. As much as I usually praise regexes for their speed, machine generated ones are only usually fast. Depending on the structure of the code to manage the structure of the regular expression is an invitation to disaster, by which I mean, consumer remorse.

In the end, the visitor pattern won out; but if you would like to attempt it in another way, I encourage it. I’ll be documenting the visitor pattern here, as it assures a single pass through the target code.

Before I go further, let’s begin with the source code to parser.

import java.util.List;

import oberlin.builder.parser.ast.AST;
import oberlin.builder.parser.ast.EOT;
import oberlin.builder.visitor.Visitor;

/**
 * @author © Michael Eric Oberlin Dec 15, 2014
 *
 * @param <V> the visitor type that the parser
 * uses for creating nonterminal nodes
 * @param <P> the target class for the parsing, intended
 * to be the root of the produced syntax tree
 */
public abstract class Parser<V extends Visitor> {
    
    private final List<AST> astList;
    private ErrorReporter reporter = new ErrorReporter();
    private AST currentToken;
    private SourcePosition currentTokenPosition = new SourcePosition();
    private SourcePosition previousTokenPosition = new SourcePosition();
    protected V visitor;
    
    public Parser(V visitor, List<AST> astList, ErrorReporter reporter) {
        this.visitor = visitor;
        
        //Do a little defensive programming
        if(astList.isEmpty())
            throw new RuntimeException("AST list cannot begin at zero size");
        
        //Scan for a specific reporter, or use the default error reporter. Of
        //course error reporter can't really be null.
        if(reporter != null)
            this.reporter = reporter;
        this.astList = astList;
        
        this.currentToken = astList.get(0);
    }
    
    /**
     * Checks whether the current node is of the expected type; if so,
     * increments the token; otherwise, throws a syntactic error.
     * 
     * @param astExpected the currently anticipated node type in the list
     */
    public void accept(Class<? extends AST> astExpected) {
        if(astExpected.isAssignableFrom(currentToken.getClass())) {
            forceAccept();
        } else {
            reporter.error(new SyntaxException("Expected " +
                astExpected + ", got " + currentToken + " instead; "));
        }
    }
    
    public void forceAccept() {
        previousTokenPosition = currentTokenPosition;
        currentTokenPosition = currentTokenPosition.increment();
        try {
            currentToken = astList.get(currentTokenPosition.getStart());
        } catch(IndexOutOfBoundsException ex) {
            currentToken = new EOT();    //end of tree
        }
    }
    
    /**
     * Records the position of the beginning of a phrase.
     * This is the position of first constituent AST.
     * @param position element to record the begin index into.
     */
    public void start(SourcePosition position) {
        position.setStart(currentTokenPosition.getStart());
    }
    
    /**
     * Finish records the position of the end of a phrase.
     * This is the position of the last constituent AST.
     * @param position element to record the end index into.
     */
    public void finish(SourcePosition position) {
        position.setFinish(currentTokenPosition.getFinish());
    }
    
    /** utility method for reporting syntax errors */
    public void syntacticError(String messageTemplate,
        String tokenQuoted) {
        SourcePosition pos = currentTokenPosition;
        reporter.error(new SyntaxException(
                tokenQuoted + " " + messageTemplate + ": " +
                pos.getStart() + ".." + pos.getFinish()));
    }
    
    /**
     * Begin parsing, aiming to create the provided class
     * as a root class for the abstract syntax tree.
     * 
     * @param rootClass Class of object which should, provided
     * no exceptions, be a tree root.
     * @return complete tree, stemming from class rootClass,
     * expressing program.
     */
    public AST parse(Class<? extends AST> rootClass) {
        return visitor.visit(rootClass, this, currentTokenPosition);
    }

    public SourcePosition getPreviousTokenPosition() {
        return this.previousTokenPosition;
    }
    
    public SourcePosition getCurrentTokenPosition() {
        return this.currentTokenPosition;
    }

    public AST getCurrentToken() {
        return currentToken;
    }
    
    public V getVisitor() {
        return visitor;
    }
    
    public ErrorReporter getErrorReporter() {
        return reporter;
    }

}

It all begins with a call to parse(…), passing in the class that represents a completed program. Parse enters the visitor with a reference to the type of analyzer needed; that to deliver a properly formatted “rootClass”. I’m going to abstain, for now, on explaining how the visitor works; that’s covered in another section. For the moment, consider the variant utility methods in Parser.

It contains the original list of ASTs, presumably all Terminals; and a number of pointer fields. The List<AST> is technically outside data, passed in to form a tree from. In the class’s current state, it is very important to remember that it cannot be used on more than one chunk of code at once.

An Algebraic Parser

The intention, just as with a scanner, is to make life as simple for the end programmer as possible. So, AlgebraicParser isn’t that big a deal to implement.

package oberlin.algebra.builder.parser;

import oberlin.builder.parser.Parser;

public class AlgebraicParser extends
        Parser<AlgebraicPhraseStructure> {

    @Override
    public Class<AlgebraicPhraseStructure> getPhraseStructure() {
        return AlgebraicPhraseStructure.class;
    }

}

Most of the work is done by AlgebraicPhraseStructure, an extension of an interface I’ve named PhraseStructure, which extends Visitor. Accordingly, I’ll be addressing it in full in part two. The take-away is that PhraseStructure encapsulates all of the constraints on how a command is properly formed.

Well, it forms no critique of ethics, but it at least worries about it’s readability.

The Take-Away

One, but hardly the only, major confusion in Parser construction is the encapsulation of the act of parsing, versus the encapsulation of the rules of parsing. They are, ultimately, two different things requiring two different programs (or in Java’s language, classes).

Parser returns a singular AST, which is formed from a number of others. Each of the others is formed for a list of still more unique nodes. We have a specific advantage provided to us here, that being the enforced uniqueness of a Java object. Once the tree is completed, remaining operations on it, such as translating it to another kind of tree (varying in language), are fascinating but relatively trivial.

Next, read about the visitor pattern, and consider reading about the GUI interface built for this; then we’ll discuss the workhorse of the scanning and parsing utility—the phrase structure.

Advertisements
 
 

Tags: , , , , , ,

One response to “Software Language Engineering: Establishing a Parser, Part One (Early Edition, Refurbished)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: