RSS

Software Language Engineering: Establishing a Parser, Part Two (Early Edition)

27 Dec

Introductory

So, you’ve ready part one, and you’re at least familiar with a visitor pattern, right? If not, I strongly encourage reading the two injected sections first.

A parser delegates the vast majority of its work to a Visitor. More appropriately stated, it depends upon the Visitor in order to do its work, as the Visitor is responsible for creating the requested nodes.

PhraseStructure classes

I have a simple extension of Visitor which I have created purely for the sake of future modifications. It’s called PhraseStructure. At the moment, it looks like this:

package oberlin.builder.parser;

import oberlin.builder.*;
import oberlin.builder.parser.ast.AST;
import oberlin.builder.visitor.Visitor;

import java.util.*;

public interface PhraseStructure extends Visitor {
}

….which makes it a marker interface. However, should you or I choose to add specific behavior to the Visitor which strictly relates to this program, it’s an excellent low-footprint stand-in.

The point where it, and by that I also mean Visitor, shows its worth is in AlgebraicPhraseStructure.

package oberlin.algebra.builder.parser;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.function.BiFunction;

import oberlin.builder.parser.Parser;
import oberlin.builder.parser.PhraseStructure;
import oberlin.builder.parser.SourcePosition;
import oberlin.builder.parser.ast.AST;
import oberlin.builder.parser.ast.EOT;
import oberlin.algebra.builder.nodes.*;

public class AlgebraicPhraseStructure implements PhraseStructure {
    
    private Map<Class<? extends AST>, BiFunction<Parser<?>, 
        SourcePosition, ? extends AST>> map = new HashMap<>();
    {
        map.put(Program.class, new BiFunction<Parser<?>,
                SourcePosition, AST>() {
            @Override
            public Program apply(Parser<?> parser,
                    SourcePosition position) {
                Program program = null;
                SourcePosition previous =
                    parser.getPreviousTokenPosition();
                AST currentToken = parser.getCurrentToken();
                
                Equality equality = (Equality) parser.getVisitor()
                        .visit(Equality.class, parser, previous);
                program = new Program(previous, equality);
                
                if(!(currentToken instanceof EOT)) {
                    parser.syntacticError("Expected end of program",
                            currentToken.getClass().toString());
                }
                
                return program;
            }
        });
        map.put(Equality.class, new BiFunction<Parser<?>,
            SourcePosition, AST>() {

            @Override
            public AST apply(Parser<?> parser,
                    SourcePosition position) {
                Equality equality = null;
                List<AST> nodes = new ArrayList<>();
                SourcePosition operationPosition =
                    new SourcePosition();
                
                parser.start(operationPosition);
                //parse operation
                AST operation = parser.getVisitor().visit(
                        Operation.class, parser, operationPosition);
                nodes.add(operation);
                if(parser.getCurrentToken() instanceof Equator) {
                    nodes.add(parser.getCurrentToken());
                    parser.forceAccept();
                    nodes.add(parser.getVisitor().visit(
                            Operation.class, parser,
                            operationPosition));
                } else {
                    parser.syntacticError("Expected: equator",
                            Integer.toString(
                                parser.getCurrentToken().getPosition()
                                .getStart()));
                }
                parser.finish(operationPosition);
                
                equality = new Equality(operationPosition, nodes);
                return equality;
            }
            
        });
        map.put(Operation.class, new BiFunction<Parser<?>,
            SourcePosition, AST>() {

            @Override
            public AST apply(Parser<?> parser,
                SourcePosition position) {
                
                Operation operation = null;
                List<AST> nodes = new ArrayList<>();
                SourcePosition operationPosition =
                    new SourcePosition();
                
                parser.start(operationPosition);
                //parse identifier
                AST identifier = parser.getVisitor().visit(
                        Identifier.class,
                        parser, operationPosition);
                nodes.add(identifier);
                //look for operator
                if(parser.getCurrentToken() instanceof Operator) {
                    nodes.add(parser.getCurrentToken());
                    parser.forceAccept();
                    nodes.add(parser.getVisitor().visit(
                            Operation.class,
                            parser, operationPosition));
                }
                parser.finish(operationPosition);
                
                operation = new Operation(operationPosition, nodes);
                return operation;
            }
            
        });
        map.put(Identifier.class, new BiFunction<Parser<?>,
            SourcePosition, AST>() {

            @Override
            public AST apply(Parser<?> parser,
                SourcePosition position) {
                
                Identifier identifier = null;
                List<AST> nodes = new ArrayList<>();
                SourcePosition identifierPosition = new SourcePosition();
                
                parser.start(identifierPosition);
                if(parser.getCurrentToken() instanceof LParen) {
                    nodes.add(parser.getCurrentToken());
                    parser.forceAccept();
                    
                    nodes.add(getHandlerMap().get(Operation.class)
                            .apply(parser, identifierPosition));
                    parser.accept(Operation.class);
                    
                    nodes.add(parser.getCurrentToken());
                    parser.accept(RParen.class);
                } else if(parser.getCurrentToken()
                    instanceof Nominal) {
                    nodes.add(parser.getCurrentToken());
                    parser.forceAccept();
                } else if(parser.getCurrentToken()
                    instanceof Numeric) {
                    nodes.add(parser.getCurrentToken());
                    parser.forceAccept();
                } else {
                    parser.syntacticError(
                            "Nominal or numeric token expected",
                            parser.getCurrentToken().getClass()
                                 .toString());
                }
                parser.finish(identifierPosition);
                identifier =
                    new Identifier(identifierPosition, nodes);
                
                return identifier;
            }
            
        });
    }
    
    @Override
    public Map<Class<? extends AST>, BiFunction<Parser<?>,
        SourcePosition, ? extends AST>> getHandlerMap() {
        // TODO Auto-generated method stub
        return map;
    }
    
}

For all of the code, you’ll note that there’s only one method. getHandlerMap() returns a map, intrinsic to the PhraseStructure, which maps classes (of any extension of AST) to functions which return them. These functions, specifically BiFunctions, accept only a Parser, with all of its delicious utility methods, and a SourcePosition so that they have an idea where they’re looking. All necessary data is in those two items alone.

A Note on Source Position

If you’ve been paying very close attention, you may have noticed that SourcePosition isn’t strictly necessary to translate. You’re right, mostly; but when something goes wrong, it is SourcePosition which tells you where the problem showed up, and what you need to tinker with in order to properly format the program.

It wasn’t always like this. Early compilers would simply indicate that the program was misformatted. More likely, just print “ERROR”, as the notion of software development (which didn’t involve punching holes in punch cards) was relatively young, and ethics weren’t really a thing yet.

This wasn’t a big deal, while programs were generally only a few lines and had exceedingly small lexicons of keywords. When Grace Murray Hopper put together A-0, the idea of adding sophisticated error reporting would have seemed like over-programming; mostly because it would have been over-programming.

As time went on, and machines got more sophisticated, having an error in your code could take days to find. If you had more than one error, then you were really in trouble. So, eventually, a team came up with the idea of reporting the exact point where the format failed, and history was made. (I’m not sure who that was, so if anyone knows, please inform me through the comments.)

Today, every well-designed AST is aware of exactly where it, or its constituents, begin and end. If you want to be especially sophisticated, you can have it remember line number, and even character number, too.

Our current edition of our algebra-like language is generally one-line-only and relatively domain specific, but memorization of where the ASTs go wrong provides room for growth.

Visit Handlers

If you don’t remember specifically, Visitor’s visit is not a complicated method.

public default AST visit(Class<? extends AST> element,
        Parser<?> parser, SourcePosition position) {
    AST ast = getHandlerMap().get(element).apply(parser, position);
    return ast;
}

It simply retrieves the map, grabs the BiFunction associated with the provided element, and applies it to the parser and an initial source position. From there, all work goes on in the map.

The visit handlers themselves can get pretty messy, if you aren’t careful. They begin by initializing their specific brand of AST to null. A NullaryAST or an Optional might be better here, as I have a serious aversion to methods that can return null, but I haven’t made that change yet. This AST is the item which will be initialized through the context of local nodes.

Next, a SourcePosition is initialized. This will be the element passed to the constructor for our AST. When Parser.start(SourcePosition) is called, it updates the starting point of SourcePosition. When Parser.finish(SourcePosition) is called, it updates the end point. These are set to Parser’s currently known coordinate in the code. Thus, before anything else is done, Parser.start(…) is called.

After the SourcePosition has been started, the class of each token is checked against allowed conditions. As such, the bulk of these methods are conditionals. It’s here that I must explain the usage of Parser.accept(…) and Parser.forceAccept().

Parser.accept(…) checks the class of the current token against the provided one, and if they match, increments the internal pointers. If not, it reports a syntactic error, and leaves the pointers alone. Since the pointer is left alone, additional nodes can still be parsed, and multiple errors can be caught, even in the sake of a token simply being missing or skipped. Parser.forceAccept() always accepts the current node, regardless of its type, and increments the pointers. (In fact, it is called from within accept(…) after the conditional checks are completed.

Once all possibilities have been checked for, the AST is initialized and returned. If at any point no possibilities remain for this token, a syntax error is thrown, and the program continues to parse (even though it cannot complete the tree).

Is There Another Way to Do This?

There’s always another way, but that doesn’t mean that it’s necessarily better. One method might be catching customized exceptions on a distinct schedule, which also works pretty well; the down side is that it only allows for the detection of a single error at a time. Another would be the construction of a string representing the AST types, and usage of a regular expression on it; but as I’ve said before, the construction of improper code, even if it compiles, can create devastatingly slow regular expressions at seemingly arbitrary times.

I’ve experimented with both on the way to this code, which is precisely why writing this section took so much longer than the others. There are probably dozens of other readily available methods which I haven’t even thought of yet. One of them, somewhere, might even be faster or sufficiently more effective than the visitor pattern.

This is not me saying that the visitor pattern is perfect, either. This implementation of visitor has a lot of marks against it. It is extremely tightly coupled, for starters, as loose as the interface alone may be. It uses “instanceof” all over the place, which begs for the implementation of further methods to keep to an OOP standard. It has many anonymous classes around, which substantially increase the memory footprint. The slightest of mistakes in the layout of the visitor functions will result in an unbounded recursion, which will quickly and almost silently crash your program, so it is not recommended for avant garde programming—always start with a properly reduced Backus Naur Form of your language. I could go on, such as with the many potential issues with secondary delegation, which the visitor pattern survives on, but this more than covers it.

My advice? Use ethical method names, comment until your fingers bleed, trigger exceptions everywhere the program pointer shouldn’t be, and benchmark benchmark benchmark. In select cases, the Visitor is still your friend, provided that you treat it like a sacred relic wired to a bomb.

Final Notes

You may notice that this is awfully similar to the enumeration used for scanning. You can, in fact, create a Scanner from a Parser, by treating every character as an individual token. However, this has not been done in a long time, as regular expressions are quite reliable for cases like this. I may yet develop a Scanner from a Parser, but only as an example, this does not mean that I recommend it.

You can think of the individual differences between one language and another as the impulse behind these enumerations and mappings. Parser will always be Parser, PhraseStructure will always be PhraseStructure. However, when you need to compile a specific language into an AST tree, the features that make that language what it is can all be stored in the enumerations and maps. Because of this, this API allows for rapid construction of builders.

Next, we talk about identification tables.

Advertisements
 

Tags: , , , , , , , , , , , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: