RSS

Tag Archives: AST

Software Language Engineering: Establishing a Parser, Part Two (Early Edition)

Introductory

So, you’ve ready part one, and you’re at least familiar with a visitor pattern, right? If not, I strongly encourage reading the two injected sections first.

A parser delegates the vast majority of its work to a Visitor. More appropriately stated, it depends upon the Visitor in order to do its work, as the Visitor is responsible for creating the requested nodes.

PhraseStructure classes

I have a simple extension of Visitor which I have created purely for the sake of future modifications. It’s called PhraseStructure. At the moment, it looks like this:

package oberlin.builder.parser;

import oberlin.builder.*;
import oberlin.builder.parser.ast.AST;
import oberlin.builder.visitor.Visitor;

import java.util.*;

public interface PhraseStructure extends Visitor {
}

….which makes it a marker interface. However, should you or I choose to add specific behavior to the Visitor which strictly relates to this program, it’s an excellent low-footprint stand-in.

The point where it, and by that I also mean Visitor, shows its worth is in AlgebraicPhraseStructure.

package oberlin.algebra.builder.parser;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.function.BiFunction;

import oberlin.builder.parser.Parser;
import oberlin.builder.parser.PhraseStructure;
import oberlin.builder.parser.SourcePosition;
import oberlin.builder.parser.ast.AST;
import oberlin.builder.parser.ast.EOT;
import oberlin.algebra.builder.nodes.*;

public class AlgebraicPhraseStructure implements PhraseStructure {
    
    private Map<Class<? extends AST>, BiFunction<Parser<?>, 
        SourcePosition, ? extends AST>> map = new HashMap<>();
    {
        map.put(Program.class, new BiFunction<Parser<?>,
                SourcePosition, AST>() {
            @Override
            public Program apply(Parser<?> parser,
                    SourcePosition position) {
                Program program = null;
                SourcePosition previous =
                    parser.getPreviousTokenPosition();
                AST currentToken = parser.getCurrentToken();
                
                Equality equality = (Equality) parser.getVisitor()
                        .visit(Equality.class, parser, previous);
                program = new Program(previous, equality);
                
                if(!(currentToken instanceof EOT)) {
                    parser.syntacticError("Expected end of program",
                            currentToken.getClass().toString());
                }
                
                return program;
            }
        });
        map.put(Equality.class, new BiFunction<Parser<?>,
            SourcePosition, AST>() {

            @Override
            public AST apply(Parser<?> parser,
                    SourcePosition position) {
                Equality equality = null;
                List<AST> nodes = new ArrayList<>();
                SourcePosition operationPosition =
                    new SourcePosition();
                
                parser.start(operationPosition);
                //parse operation
                AST operation = parser.getVisitor().visit(
                        Operation.class, parser, operationPosition);
                nodes.add(operation);
                if(parser.getCurrentToken() instanceof Equator) {
                    nodes.add(parser.getCurrentToken());
                    parser.forceAccept();
                    nodes.add(parser.getVisitor().visit(
                            Operation.class, parser,
                            operationPosition));
                } else {
                    parser.syntacticError("Expected: equator",
                            Integer.toString(
                                parser.getCurrentToken().getPosition()
                                .getStart()));
                }
                parser.finish(operationPosition);
                
                equality = new Equality(operationPosition, nodes);
                return equality;
            }
            
        });
        map.put(Operation.class, new BiFunction<Parser<?>,
            SourcePosition, AST>() {

            @Override
            public AST apply(Parser<?> parser,
                SourcePosition position) {
                
                Operation operation = null;
                List<AST> nodes = new ArrayList<>();
                SourcePosition operationPosition =
                    new SourcePosition();
                
                parser.start(operationPosition);
                //parse identifier
                AST identifier = parser.getVisitor().visit(
                        Identifier.class,
                        parser, operationPosition);
                nodes.add(identifier);
                //look for operator
                if(parser.getCurrentToken() instanceof Operator) {
                    nodes.add(parser.getCurrentToken());
                    parser.forceAccept();
                    nodes.add(parser.getVisitor().visit(
                            Operation.class,
                            parser, operationPosition));
                }
                parser.finish(operationPosition);
                
                operation = new Operation(operationPosition, nodes);
                return operation;
            }
            
        });
        map.put(Identifier.class, new BiFunction<Parser<?>,
            SourcePosition, AST>() {

            @Override
            public AST apply(Parser<?> parser,
                SourcePosition position) {
                
                Identifier identifier = null;
                List<AST> nodes = new ArrayList<>();
                SourcePosition identifierPosition = new SourcePosition();
                
                parser.start(identifierPosition);
                if(parser.getCurrentToken() instanceof LParen) {
                    nodes.add(parser.getCurrentToken());
                    parser.forceAccept();
                    
                    nodes.add(getHandlerMap().get(Operation.class)
                            .apply(parser, identifierPosition));
                    parser.accept(Operation.class);
                    
                    nodes.add(parser.getCurrentToken());
                    parser.accept(RParen.class);
                } else if(parser.getCurrentToken()
                    instanceof Nominal) {
                    nodes.add(parser.getCurrentToken());
                    parser.forceAccept();
                } else if(parser.getCurrentToken()
                    instanceof Numeric) {
                    nodes.add(parser.getCurrentToken());
                    parser.forceAccept();
                } else {
                    parser.syntacticError(
                            "Nominal or numeric token expected",
                            parser.getCurrentToken().getClass()
                                 .toString());
                }
                parser.finish(identifierPosition);
                identifier =
                    new Identifier(identifierPosition, nodes);
                
                return identifier;
            }
            
        });
    }
    
    @Override
    public Map<Class<? extends AST>, BiFunction<Parser<?>,
        SourcePosition, ? extends AST>> getHandlerMap() {
        // TODO Auto-generated method stub
        return map;
    }
    
}

For all of the code, you’ll note that there’s only one method. getHandlerMap() returns a map, intrinsic to the PhraseStructure, which maps classes (of any extension of AST) to functions which return them. These functions, specifically BiFunctions, accept only a Parser, with all of its delicious utility methods, and a SourcePosition so that they have an idea where they’re looking. All necessary data is in those two items alone.

A Note on Source Position

If you’ve been paying very close attention, you may have noticed that SourcePosition isn’t strictly necessary to translate. You’re right, mostly; but when something goes wrong, it is SourcePosition which tells you where the problem showed up, and what you need to tinker with in order to properly format the program.

It wasn’t always like this. Early compilers would simply indicate that the program was misformatted. More likely, just print “ERROR”, as the notion of software development (which didn’t involve punching holes in punch cards) was relatively young, and ethics weren’t really a thing yet.

This wasn’t a big deal, while programs were generally only a few lines and had exceedingly small lexicons of keywords. When Grace Murray Hopper put together A-0, the idea of adding sophisticated error reporting would have seemed like over-programming; mostly because it would have been over-programming.

As time went on, and machines got more sophisticated, having an error in your code could take days to find. If you had more than one error, then you were really in trouble. So, eventually, a team came up with the idea of reporting the exact point where the format failed, and history was made. (I’m not sure who that was, so if anyone knows, please inform me through the comments.)

Today, every well-designed AST is aware of exactly where it, or its constituents, begin and end. If you want to be especially sophisticated, you can have it remember line number, and even character number, too.

Our current edition of our algebra-like language is generally one-line-only and relatively domain specific, but memorization of where the ASTs go wrong provides room for growth.

Visit Handlers

If you don’t remember specifically, Visitor’s visit is not a complicated method.

public default AST visit(Class<? extends AST> element,
        Parser<?> parser, SourcePosition position) {
    AST ast = getHandlerMap().get(element).apply(parser, position);
    return ast;
}

It simply retrieves the map, grabs the BiFunction associated with the provided element, and applies it to the parser and an initial source position. From there, all work goes on in the map.

The visit handlers themselves can get pretty messy, if you aren’t careful. They begin by initializing their specific brand of AST to null. A NullaryAST or an Optional might be better here, as I have a serious aversion to methods that can return null, but I haven’t made that change yet. This AST is the item which will be initialized through the context of local nodes.

Next, a SourcePosition is initialized. This will be the element passed to the constructor for our AST. When Parser.start(SourcePosition) is called, it updates the starting point of SourcePosition. When Parser.finish(SourcePosition) is called, it updates the end point. These are set to Parser’s currently known coordinate in the code. Thus, before anything else is done, Parser.start(…) is called.

After the SourcePosition has been started, the class of each token is checked against allowed conditions. As such, the bulk of these methods are conditionals. It’s here that I must explain the usage of Parser.accept(…) and Parser.forceAccept().

Parser.accept(…) checks the class of the current token against the provided one, and if they match, increments the internal pointers. If not, it reports a syntactic error, and leaves the pointers alone. Since the pointer is left alone, additional nodes can still be parsed, and multiple errors can be caught, even in the sake of a token simply being missing or skipped. Parser.forceAccept() always accepts the current node, regardless of its type, and increments the pointers. (In fact, it is called from within accept(…) after the conditional checks are completed.

Once all possibilities have been checked for, the AST is initialized and returned. If at any point no possibilities remain for this token, a syntax error is thrown, and the program continues to parse (even though it cannot complete the tree).

Is There Another Way to Do This?

There’s always another way, but that doesn’t mean that it’s necessarily better. One method might be catching customized exceptions on a distinct schedule, which also works pretty well; the down side is that it only allows for the detection of a single error at a time. Another would be the construction of a string representing the AST types, and usage of a regular expression on it; but as I’ve said before, the construction of improper code, even if it compiles, can create devastatingly slow regular expressions at seemingly arbitrary times.

I’ve experimented with both on the way to this code, which is precisely why writing this section took so much longer than the others. There are probably dozens of other readily available methods which I haven’t even thought of yet. One of them, somewhere, might even be faster or sufficiently more effective than the visitor pattern.

This is not me saying that the visitor pattern is perfect, either. This implementation of visitor has a lot of marks against it. It is extremely tightly coupled, for starters, as loose as the interface alone may be. It uses “instanceof” all over the place, which begs for the implementation of further methods to keep to an OOP standard. It has many anonymous classes around, which substantially increase the memory footprint. The slightest of mistakes in the layout of the visitor functions will result in an unbounded recursion, which will quickly and almost silently crash your program, so it is not recommended for avant garde programming—always start with a properly reduced Backus Naur Form of your language. I could go on, such as with the many potential issues with secondary delegation, which the visitor pattern survives on, but this more than covers it.

My advice? Use ethical method names, comment until your fingers bleed, trigger exceptions everywhere the program pointer shouldn’t be, and benchmark benchmark benchmark. In select cases, the Visitor is still your friend, provided that you treat it like a sacred relic wired to a bomb.

Final Notes

You may notice that this is awfully similar to the enumeration used for scanning. You can, in fact, create a Scanner from a Parser, by treating every character as an individual token. However, this has not been done in a long time, as regular expressions are quite reliable for cases like this. I may yet develop a Scanner from a Parser, but only as an example, this does not mean that I recommend it.

You can think of the individual differences between one language and another as the impulse behind these enumerations and mappings. Parser will always be Parser, PhraseStructure will always be PhraseStructure. However, when you need to compile a specific language into an AST tree, the features that make that language what it is can all be stored in the enumerations and maps. Because of this, this API allows for rapid construction of builders.

Next, we talk about identification tables.

Advertisements
 

Tags: , , , , , , , , , , , , , ,

So What in Hell am I Doing???

Okay, it’s been like a week since the last chunk in Software Language Engineering. It’s going to be at least a week before the next update, too, which will—spoiler alert—also involve altering some of the last edition for the sake of clarity and code ethics.

So why have I been taking so freakin’ long with it?

A bunch of things, which I will have the dignity to specifically name, came up. One, writing. I’m still a little slow at it at the moment, but I had a bunch of great ideas that had to go in the books. While I’m typically hesitant to stop with a Java project for any length of time, lest I forget what I was doing, the same rule tragically applies to my other love, literature.

On top of that, I have begun a brief project in game development which has not been chronicled for the blog. It is, on the bright side, finished already. (It is a lot easier than writing a compiler, let alone a compiler/interpreter framework.) It might eventually make it into another tutorial, after I’m finished with my new-and-improved compiler tutorial, as this could easily extend into applying GLSL through Java.

Anybody who has ever been a solopreneur or even a hobbyist programmer knows that this is already enough to set me off balance, and I wish that was it, but it unfortunately is not. Not a million distractions, just a few. I’ve also done a lot of cooking lately for my girlfriend’s birthday, inclusive of a pistachio-orange chiffon birthday cake, tiramisu (as of tonight), and other fancied things such as a Cajun-Indian fusion-food split pea soup. One of these days, I might put those up on the blog, too; but they aren’t sufficiently chronicled and scripted into recipes. I’ve made a few interesting friends on Google+, which have led me to a sequence of (cool headed) debates, discussions, and for some, half-hour-long lectures on such things as the Michelson-Morley experiment and protolanguages & linguistic anthropology. (Mostly history related things.)

The worst part is, I really have been enjoying myself. I do have to apologize, as I cannot, with a straight face, say that the delay has been due to an inconvenience; only a set of distractions. At the same time, I can promise you this: the next lecture will introduce a method of circumventing common applications of reflection using a combination of maps and enumerations; which are very handy when you are dealing with a static set of classes which cannot be expected to change. I will have a fully functioning AST parser by my next chapter, concluding what I consider to be “part one” of the tutorial. Part two will be about taking that syntax tree and transforming it into another language, be it another protocol or a machine code. (I will not be discussing the bulk of machine commands for an amd64 processor, probably ever; but there’s a chance I’ll talk about the structure of portable executables. That’s .exe files—pronounced “eksies”. I might touch on a few other binaries, too.) I can also promise you that I’m going to enjoy bringing it to you.

It might be another day or two before I can dive into the code again, full-force; but I’ll have it together soon. There is a point in one’s knowledge of a subject, at which it can only be furthered through sharing that knowledge with others. I’ve been there for a while in this area.

Thanks for all of your patience!

—Mick

 
Leave a comment

Posted by on December 2, 2014 in State of the Moment

 

Tags: , , , , , ,

Software Language Engineering: Establishing a Parser, Part One (Early Edition, Refurbished)

The Introductory Ramble

I suppose I’ve been building up to this thing for quite a while now.

Note that this is part one of the Parser. Were this a singular project, I might be long-since finished with it; tragically I’m not that much of a jerk and I’m trying to write a useful tutorial too. This means I need to get it right, or as right as possible, the first time; so please pardon delays in bringing it to you. I’m also avoiding items like JLex so that I can convey exactly how a build tool is meant to be structured, for the sake of education.

You see, back in the days when it was a responsible thing to call oneself, I used to be a journalist. Specifically, I was a field journalist and interviewer for something called a “newspaper”, which was an antiquated piece of low-grade paper with news printed on it. It was a strange and inconvenient system, but it was what we had before the internet. Anyway, both language and ethics are important to me out of habit, which is also perhaps why I’m not a news journalist anymore.

As a small foreword, there are a lot of things which we must keep in mind while developing a general-purpose parser, as the operations of a Parser on a List<AST> are a bit of a juggle. In spite of the “One Class, One Responsibility” principle, the environment that a Parser is typically injected into requires a great deal of attention. As such, I will be splitting this section into two parts, with a blurb in the middle on the visitor pattern (which you are welcome to skip if you are already familiar with it), and a blurb on using JavaFX to render graphical ASTs. The second piece will be a completion of the parser design. (Point being, once you start reading, expect to be reading for quite a bit.)

The Parser

To begin, we have a List<AST> full of all Terminals in the program, in order. NonTerminals are where a lot of the magic happens, and each of them contains several nodes which may be either Terminals, or further NonTerminals. The biggest concern with any translator is efficiency in building this tree.

I have attempted several methods in the construction of this tree. One of them involved a repurposing of java.util.regex, but I recommend against it now that it has been practiced. There are too many temptations to use forward references and a number of other tools, and they can have serious caveats. As much as I usually praise regexes for their speed, machine generated ones are only usually fast. Depending on the structure of the code to manage the structure of the regular expression is an invitation to disaster, by which I mean, consumer remorse.

In the end, the visitor pattern won out; but if you would like to attempt it in another way, I encourage it. I’ll be documenting the visitor pattern here, as it assures a single pass through the target code.

Before I go further, let’s begin with the source code to parser.

import java.util.List;

import oberlin.builder.parser.ast.AST;
import oberlin.builder.parser.ast.EOT;
import oberlin.builder.visitor.Visitor;

/**
 * @author © Michael Eric Oberlin Dec 15, 2014
 *
 * @param <V> the visitor type that the parser
 * uses for creating nonterminal nodes
 * @param <P> the target class for the parsing, intended
 * to be the root of the produced syntax tree
 */
public abstract class Parser<V extends Visitor> {
    
    private final List<AST> astList;
    private ErrorReporter reporter = new ErrorReporter();
    private AST currentToken;
    private SourcePosition currentTokenPosition = new SourcePosition();
    private SourcePosition previousTokenPosition = new SourcePosition();
    protected V visitor;
    
    public Parser(V visitor, List<AST> astList, ErrorReporter reporter) {
        this.visitor = visitor;
        
        //Do a little defensive programming
        if(astList.isEmpty())
            throw new RuntimeException("AST list cannot begin at zero size");
        
        //Scan for a specific reporter, or use the default error reporter. Of
        //course error reporter can't really be null.
        if(reporter != null)
            this.reporter = reporter;
        this.astList = astList;
        
        this.currentToken = astList.get(0);
    }
    
    /**
     * Checks whether the current node is of the expected type; if so,
     * increments the token; otherwise, throws a syntactic error.
     * 
     * @param astExpected the currently anticipated node type in the list
     */
    public void accept(Class<? extends AST> astExpected) {
        if(astExpected.isAssignableFrom(currentToken.getClass())) {
            forceAccept();
        } else {
            reporter.error(new SyntaxException("Expected " +
                astExpected + ", got " + currentToken + " instead; "));
        }
    }
    
    public void forceAccept() {
        previousTokenPosition = currentTokenPosition;
        currentTokenPosition = currentTokenPosition.increment();
        try {
            currentToken = astList.get(currentTokenPosition.getStart());
        } catch(IndexOutOfBoundsException ex) {
            currentToken = new EOT();    //end of tree
        }
    }
    
    /**
     * Records the position of the beginning of a phrase.
     * This is the position of first constituent AST.
     * @param position element to record the begin index into.
     */
    public void start(SourcePosition position) {
        position.setStart(currentTokenPosition.getStart());
    }
    
    /**
     * Finish records the position of the end of a phrase.
     * This is the position of the last constituent AST.
     * @param position element to record the end index into.
     */
    public void finish(SourcePosition position) {
        position.setFinish(currentTokenPosition.getFinish());
    }
    
    /** utility method for reporting syntax errors */
    public void syntacticError(String messageTemplate,
        String tokenQuoted) {
        SourcePosition pos = currentTokenPosition;
        reporter.error(new SyntaxException(
                tokenQuoted + " " + messageTemplate + ": " +
                pos.getStart() + ".." + pos.getFinish()));
    }
    
    /**
     * Begin parsing, aiming to create the provided class
     * as a root class for the abstract syntax tree.
     * 
     * @param rootClass Class of object which should, provided
     * no exceptions, be a tree root.
     * @return complete tree, stemming from class rootClass,
     * expressing program.
     */
    public AST parse(Class<? extends AST> rootClass) {
        return visitor.visit(rootClass, this, currentTokenPosition);
    }

    public SourcePosition getPreviousTokenPosition() {
        return this.previousTokenPosition;
    }
    
    public SourcePosition getCurrentTokenPosition() {
        return this.currentTokenPosition;
    }

    public AST getCurrentToken() {
        return currentToken;
    }
    
    public V getVisitor() {
        return visitor;
    }
    
    public ErrorReporter getErrorReporter() {
        return reporter;
    }

}

It all begins with a call to parse(…), passing in the class that represents a completed program. Parse enters the visitor with a reference to the type of analyzer needed; that to deliver a properly formatted “rootClass”. I’m going to abstain, for now, on explaining how the visitor works; that’s covered in another section. For the moment, consider the variant utility methods in Parser.

It contains the original list of ASTs, presumably all Terminals; and a number of pointer fields. The List<AST> is technically outside data, passed in to form a tree from. In the class’s current state, it is very important to remember that it cannot be used on more than one chunk of code at once.

An Algebraic Parser

The intention, just as with a scanner, is to make life as simple for the end programmer as possible. So, AlgebraicParser isn’t that big a deal to implement.

package oberlin.algebra.builder.parser;

import oberlin.builder.parser.Parser;

public class AlgebraicParser extends
        Parser<AlgebraicPhraseStructure> {

    @Override
    public Class<AlgebraicPhraseStructure> getPhraseStructure() {
        return AlgebraicPhraseStructure.class;
    }

}

Most of the work is done by AlgebraicPhraseStructure, an extension of an interface I’ve named PhraseStructure, which extends Visitor. Accordingly, I’ll be addressing it in full in part two. The take-away is that PhraseStructure encapsulates all of the constraints on how a command is properly formed.

Well, it forms no critique of ethics, but it at least worries about it’s readability.

The Take-Away

One, but hardly the only, major confusion in Parser construction is the encapsulation of the act of parsing, versus the encapsulation of the rules of parsing. They are, ultimately, two different things requiring two different programs (or in Java’s language, classes).

Parser returns a singular AST, which is formed from a number of others. Each of the others is formed for a list of still more unique nodes. We have a specific advantage provided to us here, that being the enforced uniqueness of a Java object. Once the tree is completed, remaining operations on it, such as translating it to another kind of tree (varying in language), are fascinating but relatively trivial.

Next, read about the visitor pattern, and consider reading about the GUI interface built for this; then we’ll discuss the workhorse of the scanning and parsing utility—the phrase structure.

 
 

Tags: , , , , , ,

Software Language Engineering: And Again We’re Building a Scanner Again (Early Edition)

Remember how in part one, I said that there was one final change to make to the Scanner? Here it is. Now that you know about BNF, ASTs, Nonterminals, and Terminals, I can comfortably show you my completed Scanner.

The Big Difference

The difference here is that this Scanner will not return a List<String>, so much as a List<AST> full of Terminals. Each Terminal will contain one token, that is, the String that used to be in its place. It really is a minor structural change, but it is an important one.

The only other option is to scan through the list of Strings again, match each of them, and produce a List<AST>; and that’s just inefficient. We should not need to iterate through the list twice when we can get the job done with a single pass. I have rules about efficient code, and so should you.

The Scanner (for real this time)

package oberlin.builder.scanner;

import java.util.*;
import java.util.function.*;
import java.util.logging.*;

import oberlin.builder.*;
import oberlin.builder.parser.ast.*;

/**
 * New design for scanners. Requires an enumeration of type
 * TerminalSpelling to iterate over, and a code string.
 * Returns a list of tokens.
 * 
 * @author © Michael Eric Oberlin Nov 2, 2014
 *
 */
public interface Scanner<E extends Enum<E> & TerminalSpelling> extends Function<AST, List<AST>> {
    @Override
    public default List<AST> apply(AST code) {
        Logger logger = Logger.getLogger("Scanner");
        
        //Start with a list of tokens, with the singular pre-token of the bulk of code
        List<AST> tokens = new LinkedList<>();
        tokens.add(code);
        
        //For each item found in the code, parse it, and replace the contents of tokens
        //with the parsed contents
        for(int index = 0; index < tokens.size(); index++) {
            //maintain for check
            AST origToken = tokens.get(index);
            
            List<AST> newTokens = new ArrayList<>();
            for(TerminalSpelling terminalSpelling : getSpelling().getEnumConstants()) {
                
                try {
                    newTokens.addAll(terminalSpelling.matchToken(tokens.get(index)));
                    
                    replaceItemWithCollection(tokens, index, newTokens);
                    break;
                } catch (MismatchException e) {
                    //didn't match, so continue to the next item
                    continue;
                }
            }
            
            //Final defensive check: if one token was received, and one token provided, and
            //the old is not the same as the new, then the only thing that happened was the
            //removal of irrelevant data. Thus, index should be decremented so that this new
            //item may be scanned again.
            if(newTokens.size() == 1 && !newTokens.get(0).equals(origToken)) index--;
        }
        
        //This algorithm will always terminate the token list with an empty token, the one item
        //which cannot have semantic value. So, remove it here.
        tokens.remove(tokens.size() - 1);
        
        return tokens;
    }
    
    /**
     * Internally used method which replaces an entry in a provided list with a collection.
     * 
     * @param list list to be altered
     * @param entry numeric index of replaced entry
     * @param replacement item to insert in place of current data
     */
    static <E> void replaceItemWithCollection(List<E> list, int entry, Collection<E> replacement) {
        list.remove(entry);
        list.addAll(entry, replacement);
    }
    
    /**
     * 
     * @return TerminalSpelling allocated to token recognition.
     */
    public Class<E> getSpelling();
}

So that’s your Scanner, almost exactly the same, but now it uses an enumeration of a type called TerminalSpelling. Its structure might also look a little familiar.

package oberlin.builder;

import java.util.LinkedList;
import java.util.List;
import java.util.regex.*;

import oberlin.builder.parser.ast.*;

public interface TerminalSpelling {
    
    public default List<AST> matchToken(AST ast) throws MismatchException {
        List<AST> returnable = new LinkedList<>();
        
        Pattern pattern = getPattern();
        Matcher matcher = pattern.matcher(ast.toString());
        if(matcher.find()) {
            returnable.addAll(manageToken(matcher));
            returnable.add(new Terminal(ast.toString().substring(matcher.end())));
            return returnable;
        } else throw new MismatchException("String \"" + ast +
            "\" does not match grammar pattern + \"" + pattern + "\"");
    }
    
    public Pattern getPattern();
    
    /**
     * Determines, often through enumeration self-inspection, what should be done
     * with a passed token.
     * 
     * @param token the original token removed from the complete String by matchToken
     * @return appropriate value given the circumstances and implementation of TerminalSpelling.
     */
    public List<AST> manageToken(Matcher matcher);
}

And how does this new structure related to our pilot programming language, the Algebraic Simplifier?

package oberlin.algebra.builder.scanner;

import java.util.List;

import oberlin.algebra.builder.AlgebraicSpelling;
import oberlin.builder.scanner.Scanner;

public class AlgebraicScanner implements Scanner<AlgebraicSpelling> {
    
    @Override
    public Class<AlgebraicSpelling> getSpelling() {
        return AlgebraicSpelling.class;
    }

}

Cake.

As for the TerminalSpelling enumeration?

package oberlin.algebra.builder;

import java.util.LinkedList;
import java.util.List;
import java.util.logging.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import oberlin.builder.TerminalSpelling;
import oberlin.builder.TerminalSpellingHandler;
import oberlin.builder.parser.ast.AST;
import oberlin.algebra.builder.nodes.*;

/*
 * Major changes:
 * For starters, if a chunk of the string matches a pattern, then that group
 * needs to be returned as the spelling of a token.
 * <Specific Atomic Type> ← Token ← Terminal ← AST
 * 
 * All tokens should contain a reference to their regular expression/grammar.
 * 
 */
public enum AlgebraicSpelling implements TerminalSpelling {
        //COMMENTS
        WHITESPACE(Pattern.compile("^\\s+"), GrammarType.COMMENT, new TerminalSpellingHandler<Whitespace>(){

            @Override
            public Whitespace getTerminal(String spelling) {
                return new Whitespace(spelling);
            }
            
        }),
        BLOCK_COMMENT(Pattern.compile("^/\\*.*?\\*/"), GrammarType.COMMENT, new TerminalSpellingHandler<BlockComment>(){

            @Override
            public BlockComment getTerminal(String spelling) {
                return new BlockComment(spelling);
            }
            
        }),
        LINE_COMMENT(Pattern.compile("^//.*+$"), GrammarType.COMMENT, new TerminalSpellingHandler<LineComment>(){

            @Override
            public LineComment getTerminal(String spelling) {
                return new LineComment(spelling);
            }
            
        }),
        
        //VALIDS
        NOMINAL(Pattern.compile("^[\\D&&\\w]\\w+"), GrammarType.KEEP, new TerminalSpellingHandler<Nominal>(){

            @Override
            public Nominal getTerminal(String spelling) {
                return new Nominal(spelling);
            }
            
        }),
        NUMERIC(Pattern.compile("^\\d+"), GrammarType.KEEP, new TerminalSpellingHandler<Numeric>(){

            @Override
            public Numeric getTerminal(String spelling) {
                return new Numeric(spelling);
            }
            
        }),
        OPERATOR(Pattern.compile("^[+-/\\\\÷\\*×\\^]"), GrammarType.KEEP, new TerminalSpellingHandler<Operator>(){

            @Override
            public Operator getTerminal(String spelling) {
                return new Operator(spelling);
            }
            
        }),
        EQUATOR(Pattern.compile("^!?=?[=><]"), GrammarType.KEEP, new TerminalSpellingHandler<Equator>(){

            @Override
            public Equator getTerminal(String spelling) {
                return new Equator(spelling);
            }
            
        }),
        
        //DELIMITERS
        LPAREN(Pattern.compile("^\\("), GrammarType.KEEP, new TerminalSpellingHandler<LParen>(){

            @Override
            public LParen getTerminal(String spelling) {
                return new LParen(spelling);
            }
            
        }),
        RPAREN(Pattern.compile("^\\)"), GrammarType.KEEP, new TerminalSpellingHandler<RParen>(){

            @Override
            public RParen getTerminal(String spelling) {
                return new RParen(spelling);
            }
            
        })
    ;
    
    //PRIVATE FIELDS
    private Logger logger = Logger.getLogger("AlgebraicSpelling");
    
    //ENUMERATIONS
    private enum GrammarType {
        KEEP,
        COMMENT;
    }
    
    //FIELDS
    private final Pattern pattern;
    private final GrammarType type;
    private final TerminalSpellingHandler<?> handler;
    
    //CONSTRUCTORS
    /**
     * The two underlying details for any syntax pattern are its regular expression, and its semantic meaning.
     * The pattern is simply a compiled regular expression for matching the item. The type is a clue to its
     * semantic meaning, which is checked in a switch-case statement when called.
     * 
     * @param pattern compiled regular expression identifying the token
     * @param type clue as to what should be done with the token once found
     */
    private AlgebraicSpelling(Pattern pattern, GrammarType type, TerminalSpellingHandler<?> handler) {
        this.pattern = pattern;
        this.type = type;
        this.handler = handler;
    }
    
    //GETTERS/SETTERS
    @Override
    public Pattern getPattern() {
        return pattern;
    }

    @Override
    public List<AST> manageToken(Matcher matcher) {
        List<AST> ret = new LinkedList<>();
        switch(this.type) {
        case KEEP:
            ret.add(handler.getTerminal(matcher.group()));
            break;
        case COMMENT:
            //Just ignore it
            break;
        }
        return ret;
    }

}

A bit longer, but as you can see, still very simple. Before I address the minor details reserved for a language implementation, allow me to give you the trivial definition of TerminalSpellingHandler:

package oberlin.builder;

public interface TerminalSpellingHandler<E extends Terminal> {
    public E getTerminal(String spelling);
}

Barely even worth a class, really. It could easily just be a java.util.function.Function<String, E extends Terminal>. It remains a candidate for expansion, should you need it; and unlike Function, can easily be altered to throw a non-runtime exception. (Should you find yourself needing to throw such a thing.)

So, by my recommended structure, the enumerations contain three simple things. Specific java.util.regex.Patterns describing how they appear (always beginning with the metacharacter “^” to denote the beginning of the String), an enumeration describing what to do with the token once it’s found, and a functional interface that acquires a Terminal from the token. I will concede that some of these items are candidates for defaults in the interface, but as you know, not all languages are structured the same and for the moment, I leave this to you.

Our End Result

Ultimately, we have a class that returns a List<AST> full of Terminals, which is exactly what our Parser works with. This allows for redundant calls and stripped levels of processing that allow for a much cleaner Θ-notation.

Additionally, the class captures almost every recurrent portion of Scanner that I can think of, and is easily and quickly extended for whatever your target language might be. This is the greater service of a properly designed class—it takes work and stress off of the programmer. It is likewise going to make the next step much easier.

Now we build us a Parser.

 
 

Tags: , , , , , , , , ,

Software Language Engineering: Terminals, Nonterminals, and Bears (Early Edition)

The Notion of a Terminal AST

So, in the last chapter, I explained what an AST was structurally. There are formally two kinds of extensions to it. I usually implement them in their own classes, extending the base class of AST.

The first is the terminal. If you’ve programmed in Java for even a month, you know that having a method which accepts two different kinds of unrelated classes in the stead of one another is a bad idea for all kinds of reasons.

It is, actually, possible.

That just doesn’t mean that you should do it.

ASTs are formally called trees, but what they are is nodes on a tree. A program is a single AST, with an extension typically called “Program” as the overarching root node. The branch nodes, or nodes with child nodes, are called non-terminal nodes; the end points are called terminal nodes.

Each of those tokens that your Scanner kicked out? Yeah, that’s a terminal node, disguised as a String.

Let me offer you some of my code for ASTs, Terminals, and NonTerminals. (As before, there are major issues that I’m leaving out until later on. See if you can catch them.)

package oberlin.builder.parser.ast;

import oberlin.builder.codegenerator.RuntimeEntity;
import oberlin.builder.visitor.*;

/**
 * Abstract Syntax Tree, capable of representing any sequence of 
 * statements or the entire program.
 * 
 * @author © Michael Eric Oberlin Nov 3, 2014
 *
 */
public interface AST {
    /**
     * @return number of sub-elements contained in this tree node.
     */
    public int getElementCount();
}
package oberlin.builder;

import oberlin.builder.parser.ast.AST;

/**
 * Basis of all complete abstract syntax trees. Terminals are basically isolated-tokens known only by their spellings.
 * 
 * @author © Michael Eric Oberlin Nov 5, 2014
 *
 */
public class Terminal implements AST {
    private final String spelling;
    
    public Terminal(String spelling) {
        this.spelling = spelling;
    }
    
    public final String getSpelling() {
        return this.spelling;
    }

    @Override
    public String toString() {
        return getSpelling();
    }

    @Override
    public final int getElementCount() {
        return 1;
    }
}

Let those soak in a little. Note that each Terminal has a length of one, meaning that it is the only member of its immediate tree. That will be important when we develop our Parser.

A terminal is an instance of an AST, and can be created by simply passing its token to it. The token is stored in the field “spelling”. Terminal is also fully extensible, as even though their token is consistently their only members, there is a significant difference between an equals operator, a binary operator, and numerical data; and nonterminal nodes take that difference quite seriously.

The Notion of a Nonterminal AST

A nonterminal AST is an AST built not from characters in a String, but from a sequence of other ASTs. The constituent ASTs can be terminals, or nonterminals. Remember BNF? Ever listed item before may-consist-of (::=) was a nonterminal. In instances of single-member representation, such as:

Noun ::= ProperNoun

“Noun” is still a nonterminal, as the implication is that (at least in theory) multiple types of items can be its constituents.

The “Program” node is of course always a nonterminal. I’ve written a nice slice of code for them, too.

package oberlin.builder;

import java.util.*;

import oberlin.builder.parser.ast.AST;

public abstract class NonTerminal implements AST {
    private final List<AST> astList;
    
    public NonTerminal(AST... astList) throws MismatchException {
        if(!checkTypes(astList)) throw new MismatchException("Nonterminal class " + this.getClass() + " does not match " +
                "expression.");
        
        List<AST> list = new ArrayList();
        for(AST ast : astList) {
            list.add(ast);
        }
        this.astList = list; 
    }
    
    public NonTerminal(List<AST> astList) throws MismatchException {
        try {
            this.astList = resolveTypes(astList);
        } catch(BuilderException ex) {
            throw new MismatchException(ex);
        }
    }
    
    public abstract List<Class<? extends AST>> getExpectedASTTypes();
    
    /**
     * Check to see that all provided ASTs are some extension of the expected class of AST,
     * and create the internal list of ASTs from it if possible.
     * 
     * @param astList current list of program ASTs 
     * @return true if the first ASTs match the expected ones, false otherwise
     */
    private List<AST> resolveTypes(List<AST> astList) throws BuilderException {
        List<AST> ownASTs = new ArrayList<>();
        List<Class<? extends AST>> astTypes = getExpectedASTTypes();
        
        for(int i = 0; i < astTypes.size(); i++) {
            Class<? extends AST> provided, expected;
            provided = astList.get(i).getClass();
            expected = astTypes.get(i);
            if(!expected.isAssignableFrom(provided)) {
                throw new BuilderException("Cannot get " + expected + " from " + provided.getClass());
            }
            ownASTs.add(astList.get(i));
        }
        return ownASTs;
    }
    
    /**
     * Check to see that all provided ASTs are some extension of the expected class of AST.
     * 
     * @param astList current array of program ASTs 
     * @return true if the first ASTs match the expected ones, false otherwise
     */
    private boolean checkTypes(AST... astList) {
        List<Class<? extends AST>> astTypes = getExpectedASTTypes();
        
        for(int i = 0; i < astList.length; i++) {
            Class<? extends AST> provided, expected;
            provided = astList[i].getClass();
            expected = astTypes.get(i);
            if(!provided.equals(expected)) {
                return false;
            }
        }
        return true;
    }
    
    @Override
    public int getElementCount() {
        return astList.size();
    }
}

This is where the critical detail is left out of my code. If you already see it, then congratulations! In any case, I’ll let you all in on it at the end of the chapter. It goes back to the advantage of regular expressions over manually checking lists.

In “Programming Language Processors in Java”, David A. Watt and Deryck F. Brown use a visitor pattern to scan for these. That’s a perfectly valid approach. I get the same advantages through my constructor-or-exception pattern, which you have seen before. In fact, if you’re careful, you may notice that it’s another format of the same general pattern; without the traditional Visitor and Element classes. In my github repo, you might notice that I have the outline of a Visitor pattern implemented on these. Pay it no mind, it is proving unnecessary.

Still a good idea. Just unnecessary.

What Do They Mean to a Parser?

A Parser’s job is thus. Iterate through a list of Terminal syntax tree nodes. Compare the beginning of the list to established members of a list of nonterminal nodes. Find a match.

Truncate the matched beginning of the list of ASTs; this is, remove every element that is a member of the new NonTerminal. Now that they are removed, append that NonTerminal to the beginning of the list, in their place. Repeat this process until the list of ASTs is of size one, and return that singular AST.

And of course, if the program cannot be parsed down to a singular AST, throw a Parser Error.

Caveats

These are meant to be identified through members of an enumeration; just as tokens were. However, you might remember BNF items such as:

Command ::= Command+

That is, a variable number of Command nodes can be condensed into the immediate tree of a single Command node. This is an issue for a simple list of ASTs, as in its provided implementation, NonTerminal simply attempts to match a fixed sequence of AST types to a provided list of ASTs, one by one. This is almost always inadequate for a programming language.

I’m going to step around writing a whole new series on how regular expressions work. (For now.) The important detail is that a minimum and maximum number of items, extending a specific AST class, need to be matched. I was tempted to call these “Clauses”, but that just barely relates to the actual definition of “clause”; so instead, we’ll borrow from regular expression terminology and call them Groups.

The down side? I’m still implementing Groups. They will have their own chapter though.

Now, we learn how to build a Scanner. (Again. Sort of.)

 

Tags: , , , , , , , , , , , , ,

Software Language Engineering: The Abstract Syntax Tree (Early Edition)

What is an Abstract Syntax Tree?

(Let me preface this by saying that syntax trees are hardly the only way to write a translator; but they’re broadly used because of their simplicity, versatility, and flexibility.)

“Abstract” as in abstraction. An abstract syntax tree (henceforth AST) is the manner in which we will be encoding every command of our source and object programs. Think of it as an internal representation of the semantics, derived from the original syntax, and encodable in the final syntax.

The biggest trick with an AST is to remember not to confuse it for its representation. Syntax and semantics are elemental, in the old-school black-magic Aleister Crowley sense. You can’t even talk about one without comparing it to the other. This is, in my personal analysis, where a lot of the works-like-magic notion about compilers comes from; but there’s nothing magical about it.

To understand this, think about what happens when you’re speaking with another human being. The art of communication is the act of transference of a thought from one mind to another, through some mutually external medium; be it speech, or writing, or hand gestures. When someone begins to speak, and you begin to listen, your brain (operationally) initially considers every single possible thing that the speaker is about to say. As they pronounce the phonemes and syllables and words of their sentence, the possibilities narrow, and different semantic clusters in your mental network go dark. Once they’ve finished their statement, given that it was well-formed, you have a reasonably precise notion of what it was that was in their mind, which they are trying to convey to you. You can duplicate it, within reason. If it was ill-formed, you might only have a vague, or even a holistically incorrect, notion.

This is the concept of an AST; the AST is the thought itself. It is a (reasonable) binary representation of the program data. I used to think of ASTs as being something like a sentence diagram; but this isn’t really the case, and it in fact held me back for some time. My script grammars were limited to only mild variations of the structure of the language they were written in. ASTs are not grammatical structures; they are more implicit than that. This is the greatest reason why you must not confuse what an AST is for what an AST might look like; it’s the difference between using it like one would a first language, and a second language.

I’m assuming that you have already read the chapter on building an efficient Scanner. I’m also encouraging you to keep the Glossary open in another tab, as I may accidentally jump ahead in my definitions in an early edition of this piece. (Promise, I’ll review it with a giant whiteboard and lots of draft-mechanics before I declare the series finished; you are always free to scream at me about it in the comments.)

So, as an example, let’s consider basic English. Actually, let’s consider Bad English, because I’m not willing to jot down every single rule for a language with a lexicon three times larger than any other western language. Bad English will have a limited set of rules, and will generate a set of only a few basic sentences. For further simplicity, they will all be declarative, much like a programming language.

The first step, before you even begin to piece together an AST, is to write down the rules of your language in a formal specification; that is, a specification with a distinct set of rules to its grammar, which often must be known well by the reader, and allows for little chance of misinterpretation. An informal specification is usually presented in natural language, like this article; it is understood by a much wider audience, but is subject to considerable misinterpretation in contrast.

I recommend Backus-Naur Form (BNF). BNF has a number of translators written for it, often involving meta-programming which generates usable source code. I won’t be using any here, but they do exist. The chief notion for writing it is for you, particularly as you are creating a new language.

Primary BNF looks something like this.

Sentence ::= Subject Verb Object .
Subject  ::= I
Subject  ::= a Noun
Subject  ::= the Noun
Object   ::= me
Object   ::= you
Object   ::= a Noun
Object   ::= the Noun
Noun     ::= dog
Noun     ::= stick 
Noun     ::= hog
Verb     ::= ate
Verb     ::= caught
Verb     ::= bought

Before I go further, I should discuss the concept of a terminal, and nonterminal, symbol. All of the words in the above definition are some form of symbol; a terminal symbol can be parsed no farther. A nonterminal symbol can be broken down.  Any program syntax consists of a finite set of nonterminal symbols, and a finite set of terminal symbols. In the example above, “Sentence” is a nonterminal symbol, as it can be inspected to retrieve a Subject, a Verb, and an Object. “I” and “dog” are terminal symbols, as they cannot be broken down further.

Note that the BNF structure isn’t ideal yet. The “::=” is pronounced “may consist of”, and means as much. A sentence is a subject, followed by a verb, followed by an object, followed by a terminal period. (Subject-verb without object is a syntax error in Bad English.) It is thus represented as “Sentence := Subject Verb Object .” So what is a Subject? A subject may consist of “I”, may consist of “a Noun”, and may consist of “the Noun”. A Noun may consist of “dog”, “stick”, or “hog”.

A simple AST would be a Sentence tree, referencing a Subject tree, a Verb tree, and an Object tree, with an implicit period. You might already see how this could be parsed to create a sequence of generated functions, regular expressions, and switch-case statements, which would work as a parser and produce a Sentence tree. But remember how I said that the BNF was not complete? While grammatically correct and certainly easier to form, the biggest issue with making this deterministic is the repeated definitions.

You need to take something like the four definitions for “Object” and condense them into one. In BNF syntax, this is usually done with a vertical bar “|”, meaning “or”. Condensing the nonterminal definitions, we have the much more palatable:

Sentence ::= Subject Verb Noun .
Subject  ::= a Noun | the Noun | I
Object   ::= me | you | a Noun | the Noun
Noun     ::= dog | stick | hog
Verb     ::= ate | caught | bought

Much nicer, yes? But still not ideal. BNF has been around for a while; it actually goes back as far as ALGOL 58, in 1959. He was looking at Noam Chomsky’s work with context-free grammars, The first (arguable) regular expressions were actually developed in 1956 by Stephen Keene, but they were pretty primitive. Much of the format of BNF served as a major influence on them.

The above works fine for such a simple language as Bad English, as read by a human being with our own translating and interpreting internal hardware. For a program, the simpler the better, remembering that the most important factor going into BNF is preservation of syntax. Consider this instead:

Sentence ::= Subject Verb Object .
Subject  ::= (a | the) Noun | I
Object   ::= (a | the) Noun | me | you
Noun     ::= (d|h)og | stick
Verb     ::= ate | (ca|bo)ught

This is perfectly valid, and even more concise. Remove the whitespace, and any one of them can easily be condensed into a working regular expression. More importantly, redundancies become very easy to catch on implementation.

So How Do I Use It?

When you are constructing a language compiler or interpreter, especially from the ground up, always write these out first. Often, it will be exclusively for your own reference. Everything you’re about to do with the Parser module is encapsulated in that BNF-formatted data.

I have an abbreviated format, which I usually use, which basically has the type name, “::=”, and a regular expression for the sequence of elements that this type contains. There are ups and downs to using regular expressions directly; the biggest issue that I run into is that single-member may-consist-ofs (such as, “Identifier ::= Numeric | Nominal” can easily be implemented by simply having the classes for Numeric and Nominal extend a class for Identifier. Then it’s just a question of «expected class».isAssignableFrom(«provided class»). Ultimately, if you can comfortably implement it, then yes, you can do it. I’ll be showing you another way, which I feel is often a cleaner implementation.

Does This Have Something to Do With Your Caveat On the Scanner?

Why yes. Yes it does.

It’s much more convenient for a Parser to simply handle ASTs. Having it handle items which may be either ASTs, or Strings, is extremely difficult in Java. Give me one more chapter to explain the details of “Terminal” and “NonTerminal” ASTs, and then I’ll show you what needs to happen to your Scanner.

I promise it isn’t big.

[To Be Completed]

 
 

Tags: , , , , , , , , , , , , ,