RSS

Monthly Archives: November 2014

The Easy Way to Import from Guava or Apache Commons

In 1991, Sun Microsystems (specifically, James Gosling) answered a long standing question. The Java programming language, at the time called Oak, was established and released to cut development times into small fractions of what they used to be. The word was “write once, run anywhere”, which for the most part was true. A program could be compiled on one machine, and run on any that supported the same virtual machine.

Don’t get me wrong; Oak was a disaster for efficiency. However, it proved that virtual-machine based productive software wasn’t just an idea, it was actually possible. That was huge. So, Oak became Java (a word already very familiar to programmers), and Sun got to work on expanding the API and improving the compiler. Honestly, Java 1.0 was also crap; but it was very exciting crap. Java 2, in my humble opinion, was where it really took off.

During this time (Oak to now), a lot of features were added. Just-in-time compiling, regular expressions, enumerations, recently lambda expressions, and most importantly a gazillion bazillion classes were added to the JDK. All kinds of solutions to what became a very broad class of problems, shortening development time quite a bit for programmers. Surprisingly the internet, and the population of Java programmers, grew faster than Sun could keep up with. It had the added pressure of improving the compiler, too; which didn’t help.

As a response to this frustrating lag, the Open Source community (you may picture it in a hero cape) took off and created a vast assortment of additional libraries. Many of them, such as LWJGL and JOAL, were domain-specific; some of them weren’t. Apache Commons was the first big guy to come in. It’s actually a collection of libraries, the most important of which (at least for me) was the math (now Math3) library. It offered tried and tested methods for handling complex numbers, Fourier transforms, tuples, and all sorts of awesome stuff. That meant that the people, previously using the vanilla JDK, didn’t have to write it themselves. That saved a boatload of time.

Later, Google came up with Guava, their own contribution to the community (fully compatible with Apache Commons). Guava had neat features like bidirectional maps, and very handy byte conversion methods. Much like Apache Commons, it’s expanding all the time.

In olden days (1990s), it was often necessary to have the entire library as a local resource. That means on-disk. This could be an issue, when you only needed a few methods out of something as large as Math3. It is an enormous library, with a lot of binary data. Then came Apache Maven. I don’t intend to describe how to use Maven manually here, it isn’t something I’m an expert at, and it often isn’t necessary; but there are plenty of wonderful tutorials on the internet. I’m going to describe how to use it quickly.

Maven allowed for the inclusion of libraries from a URL, without the need to download the entire library to disk. As more and more computers were online 24/7, this became increasingly feasible. Through a feature of the Maven build tool, a file called pom.xml, the features of the project could be described, and lazily received as needed. The “POM” in pom.xml stands for “Project Object Model”, which is very accurate.

So How Can I Use Maven to Import these Libraries?

My IDE of choice is Eclipse (which is not to say that there aren’t other good ones out there). There’s almost always a utility native to your environment, which should work similarly to this. Eclipse has a plugin which handles Maven directly. To get it, go to the Help menu, and select “Eclipse Marketplace”. Under the Eclipse.org marketplace, look up the keyword “Maven”. You will probably have quite a few “m2e” entries, the central one usually starts with “Maven Integration for Eclipse…”, the rest generally depends on your Eclipse version.

Install it, and restart Eclipse. Next up, assuming that your project already exists, you need to create a Maven project out of it. Right-click it, select “Configure…”, and click “Convert to Maven Project”. (If it nags you about the group ID or the artifact ID, it’s because of a naïve algorithm for generating the identifiers from the project name; just remove any spaces and funny characters and try again. The details of what these identifiers are are better left to more detailed tutorials on Maven.) It will set up your Eclipse project as a Maven project as well, specifically, an M2Eclipse project.

You will have a new file called “pom.xml” located in your project directory. There are other ways to do this, but the dependency information is typically provided in raw XML and copy/pasting it is usually fastest. Enter XML mode on the document (currently by clicking the last lower tab, labelled “pom.xml”), and find the end of the “<build>” entries.

Right below the “</build>” tag, enter “<dependencies>”. Eclipse will often fill in the terminating tag for you. Between these two, you may enter the dependency information typically found on the web for the library you are using; such as, for LWJGL:

<dependency>
    <groupId>org.lwjgl.lwjgl</groupId>
    <artifactId>lwjgl</artifactId>
    <version>2.8.4</version>
</dependency>

Or, for Apache Commons Math3:

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-math3</artifactId>
    <version>3.0</version>
</dependency>

Or for Google Guava:

<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>12.0</version>
</dependency>

Then, clean your project by going to the Project menu and clicking on the “Clean” option. It’s generally good to recompile a project completely and from the ground, often called cleaning, after making a major change to its dependencies; this is often done automatically, but not always. You’ll now note that you can import any of the packages in Guava, Commons, or whichever library you have imported, without having to download the entire API.

How Does This Change Things?

The JDK is not a small library as it is, it’s actually quite enormous; but if you have ever found yourself struggling to write an extension of a collection or a math utility that could be used in a wide variety of projects, those efforts will now be fewer and further between. You may check the javadocs for these APIs as readily as the javadocs for the JDK, and need not worry about increasing the disk footprint of your development environment or (much worse) your project.

Advertisements
 
Leave a comment

Posted by on November 22, 2014 in Programming

 

Tags: , , , , , , , , , , , ,

Software Language Engineering: Establishing a Parser, Part One (Early Edition, Refurbished)

The Introductory Ramble

I suppose I’ve been building up to this thing for quite a while now.

Note that this is part one of the Parser. Were this a singular project, I might be long-since finished with it; tragically I’m not that much of a jerk and I’m trying to write a useful tutorial too. This means I need to get it right, or as right as possible, the first time; so please pardon delays in bringing it to you. I’m also avoiding items like JLex so that I can convey exactly how a build tool is meant to be structured, for the sake of education.

You see, back in the days when it was a responsible thing to call oneself, I used to be a journalist. Specifically, I was a field journalist and interviewer for something called a “newspaper”, which was an antiquated piece of low-grade paper with news printed on it. It was a strange and inconvenient system, but it was what we had before the internet. Anyway, both language and ethics are important to me out of habit, which is also perhaps why I’m not a news journalist anymore.

As a small foreword, there are a lot of things which we must keep in mind while developing a general-purpose parser, as the operations of a Parser on a List<AST> are a bit of a juggle. In spite of the “One Class, One Responsibility” principle, the environment that a Parser is typically injected into requires a great deal of attention. As such, I will be splitting this section into two parts, with a blurb in the middle on the visitor pattern (which you are welcome to skip if you are already familiar with it), and a blurb on using JavaFX to render graphical ASTs. The second piece will be a completion of the parser design. (Point being, once you start reading, expect to be reading for quite a bit.)

The Parser

To begin, we have a List<AST> full of all Terminals in the program, in order. NonTerminals are where a lot of the magic happens, and each of them contains several nodes which may be either Terminals, or further NonTerminals. The biggest concern with any translator is efficiency in building this tree.

I have attempted several methods in the construction of this tree. One of them involved a repurposing of java.util.regex, but I recommend against it now that it has been practiced. There are too many temptations to use forward references and a number of other tools, and they can have serious caveats. As much as I usually praise regexes for their speed, machine generated ones are only usually fast. Depending on the structure of the code to manage the structure of the regular expression is an invitation to disaster, by which I mean, consumer remorse.

In the end, the visitor pattern won out; but if you would like to attempt it in another way, I encourage it. I’ll be documenting the visitor pattern here, as it assures a single pass through the target code.

Before I go further, let’s begin with the source code to parser.

import java.util.List;

import oberlin.builder.parser.ast.AST;
import oberlin.builder.parser.ast.EOT;
import oberlin.builder.visitor.Visitor;

/**
 * @author © Michael Eric Oberlin Dec 15, 2014
 *
 * @param <V> the visitor type that the parser
 * uses for creating nonterminal nodes
 * @param <P> the target class for the parsing, intended
 * to be the root of the produced syntax tree
 */
public abstract class Parser<V extends Visitor> {
    
    private final List<AST> astList;
    private ErrorReporter reporter = new ErrorReporter();
    private AST currentToken;
    private SourcePosition currentTokenPosition = new SourcePosition();
    private SourcePosition previousTokenPosition = new SourcePosition();
    protected V visitor;
    
    public Parser(V visitor, List<AST> astList, ErrorReporter reporter) {
        this.visitor = visitor;
        
        //Do a little defensive programming
        if(astList.isEmpty())
            throw new RuntimeException("AST list cannot begin at zero size");
        
        //Scan for a specific reporter, or use the default error reporter. Of
        //course error reporter can't really be null.
        if(reporter != null)
            this.reporter = reporter;
        this.astList = astList;
        
        this.currentToken = astList.get(0);
    }
    
    /**
     * Checks whether the current node is of the expected type; if so,
     * increments the token; otherwise, throws a syntactic error.
     * 
     * @param astExpected the currently anticipated node type in the list
     */
    public void accept(Class<? extends AST> astExpected) {
        if(astExpected.isAssignableFrom(currentToken.getClass())) {
            forceAccept();
        } else {
            reporter.error(new SyntaxException("Expected " +
                astExpected + ", got " + currentToken + " instead; "));
        }
    }
    
    public void forceAccept() {
        previousTokenPosition = currentTokenPosition;
        currentTokenPosition = currentTokenPosition.increment();
        try {
            currentToken = astList.get(currentTokenPosition.getStart());
        } catch(IndexOutOfBoundsException ex) {
            currentToken = new EOT();    //end of tree
        }
    }
    
    /**
     * Records the position of the beginning of a phrase.
     * This is the position of first constituent AST.
     * @param position element to record the begin index into.
     */
    public void start(SourcePosition position) {
        position.setStart(currentTokenPosition.getStart());
    }
    
    /**
     * Finish records the position of the end of a phrase.
     * This is the position of the last constituent AST.
     * @param position element to record the end index into.
     */
    public void finish(SourcePosition position) {
        position.setFinish(currentTokenPosition.getFinish());
    }
    
    /** utility method for reporting syntax errors */
    public void syntacticError(String messageTemplate,
        String tokenQuoted) {
        SourcePosition pos = currentTokenPosition;
        reporter.error(new SyntaxException(
                tokenQuoted + " " + messageTemplate + ": " +
                pos.getStart() + ".." + pos.getFinish()));
    }
    
    /**
     * Begin parsing, aiming to create the provided class
     * as a root class for the abstract syntax tree.
     * 
     * @param rootClass Class of object which should, provided
     * no exceptions, be a tree root.
     * @return complete tree, stemming from class rootClass,
     * expressing program.
     */
    public AST parse(Class<? extends AST> rootClass) {
        return visitor.visit(rootClass, this, currentTokenPosition);
    }

    public SourcePosition getPreviousTokenPosition() {
        return this.previousTokenPosition;
    }
    
    public SourcePosition getCurrentTokenPosition() {
        return this.currentTokenPosition;
    }

    public AST getCurrentToken() {
        return currentToken;
    }
    
    public V getVisitor() {
        return visitor;
    }
    
    public ErrorReporter getErrorReporter() {
        return reporter;
    }

}

It all begins with a call to parse(…), passing in the class that represents a completed program. Parse enters the visitor with a reference to the type of analyzer needed; that to deliver a properly formatted “rootClass”. I’m going to abstain, for now, on explaining how the visitor works; that’s covered in another section. For the moment, consider the variant utility methods in Parser.

It contains the original list of ASTs, presumably all Terminals; and a number of pointer fields. The List<AST> is technically outside data, passed in to form a tree from. In the class’s current state, it is very important to remember that it cannot be used on more than one chunk of code at once.

An Algebraic Parser

The intention, just as with a scanner, is to make life as simple for the end programmer as possible. So, AlgebraicParser isn’t that big a deal to implement.

package oberlin.algebra.builder.parser;

import oberlin.builder.parser.Parser;

public class AlgebraicParser extends
        Parser<AlgebraicPhraseStructure> {

    @Override
    public Class<AlgebraicPhraseStructure> getPhraseStructure() {
        return AlgebraicPhraseStructure.class;
    }

}

Most of the work is done by AlgebraicPhraseStructure, an extension of an interface I’ve named PhraseStructure, which extends Visitor. Accordingly, I’ll be addressing it in full in part two. The take-away is that PhraseStructure encapsulates all of the constraints on how a command is properly formed.

Well, it forms no critique of ethics, but it at least worries about it’s readability.

The Take-Away

One, but hardly the only, major confusion in Parser construction is the encapsulation of the act of parsing, versus the encapsulation of the rules of parsing. They are, ultimately, two different things requiring two different programs (or in Java’s language, classes).

Parser returns a singular AST, which is formed from a number of others. Each of the others is formed for a list of still more unique nodes. We have a specific advantage provided to us here, that being the enforced uniqueness of a Java object. Once the tree is completed, remaining operations on it, such as translating it to another kind of tree (varying in language), are fascinating but relatively trivial.

Next, read about the visitor pattern, and consider reading about the GUI interface built for this; then we’ll discuss the workhorse of the scanning and parsing utility—the phrase structure.

 
 

Tags: , , , , , ,

Software Language Engineering: And Again We’re Building a Scanner Again (Early Edition)

Remember how in part one, I said that there was one final change to make to the Scanner? Here it is. Now that you know about BNF, ASTs, Nonterminals, and Terminals, I can comfortably show you my completed Scanner.

The Big Difference

The difference here is that this Scanner will not return a List<String>, so much as a List<AST> full of Terminals. Each Terminal will contain one token, that is, the String that used to be in its place. It really is a minor structural change, but it is an important one.

The only other option is to scan through the list of Strings again, match each of them, and produce a List<AST>; and that’s just inefficient. We should not need to iterate through the list twice when we can get the job done with a single pass. I have rules about efficient code, and so should you.

The Scanner (for real this time)

package oberlin.builder.scanner;

import java.util.*;
import java.util.function.*;
import java.util.logging.*;

import oberlin.builder.*;
import oberlin.builder.parser.ast.*;

/**
 * New design for scanners. Requires an enumeration of type
 * TerminalSpelling to iterate over, and a code string.
 * Returns a list of tokens.
 * 
 * @author © Michael Eric Oberlin Nov 2, 2014
 *
 */
public interface Scanner<E extends Enum<E> & TerminalSpelling> extends Function<AST, List<AST>> {
    @Override
    public default List<AST> apply(AST code) {
        Logger logger = Logger.getLogger("Scanner");
        
        //Start with a list of tokens, with the singular pre-token of the bulk of code
        List<AST> tokens = new LinkedList<>();
        tokens.add(code);
        
        //For each item found in the code, parse it, and replace the contents of tokens
        //with the parsed contents
        for(int index = 0; index < tokens.size(); index++) {
            //maintain for check
            AST origToken = tokens.get(index);
            
            List<AST> newTokens = new ArrayList<>();
            for(TerminalSpelling terminalSpelling : getSpelling().getEnumConstants()) {
                
                try {
                    newTokens.addAll(terminalSpelling.matchToken(tokens.get(index)));
                    
                    replaceItemWithCollection(tokens, index, newTokens);
                    break;
                } catch (MismatchException e) {
                    //didn't match, so continue to the next item
                    continue;
                }
            }
            
            //Final defensive check: if one token was received, and one token provided, and
            //the old is not the same as the new, then the only thing that happened was the
            //removal of irrelevant data. Thus, index should be decremented so that this new
            //item may be scanned again.
            if(newTokens.size() == 1 && !newTokens.get(0).equals(origToken)) index--;
        }
        
        //This algorithm will always terminate the token list with an empty token, the one item
        //which cannot have semantic value. So, remove it here.
        tokens.remove(tokens.size() - 1);
        
        return tokens;
    }
    
    /**
     * Internally used method which replaces an entry in a provided list with a collection.
     * 
     * @param list list to be altered
     * @param entry numeric index of replaced entry
     * @param replacement item to insert in place of current data
     */
    static <E> void replaceItemWithCollection(List<E> list, int entry, Collection<E> replacement) {
        list.remove(entry);
        list.addAll(entry, replacement);
    }
    
    /**
     * 
     * @return TerminalSpelling allocated to token recognition.
     */
    public Class<E> getSpelling();
}

So that’s your Scanner, almost exactly the same, but now it uses an enumeration of a type called TerminalSpelling. Its structure might also look a little familiar.

package oberlin.builder;

import java.util.LinkedList;
import java.util.List;
import java.util.regex.*;

import oberlin.builder.parser.ast.*;

public interface TerminalSpelling {
    
    public default List<AST> matchToken(AST ast) throws MismatchException {
        List<AST> returnable = new LinkedList<>();
        
        Pattern pattern = getPattern();
        Matcher matcher = pattern.matcher(ast.toString());
        if(matcher.find()) {
            returnable.addAll(manageToken(matcher));
            returnable.add(new Terminal(ast.toString().substring(matcher.end())));
            return returnable;
        } else throw new MismatchException("String \"" + ast +
            "\" does not match grammar pattern + \"" + pattern + "\"");
    }
    
    public Pattern getPattern();
    
    /**
     * Determines, often through enumeration self-inspection, what should be done
     * with a passed token.
     * 
     * @param token the original token removed from the complete String by matchToken
     * @return appropriate value given the circumstances and implementation of TerminalSpelling.
     */
    public List<AST> manageToken(Matcher matcher);
}

And how does this new structure related to our pilot programming language, the Algebraic Simplifier?

package oberlin.algebra.builder.scanner;

import java.util.List;

import oberlin.algebra.builder.AlgebraicSpelling;
import oberlin.builder.scanner.Scanner;

public class AlgebraicScanner implements Scanner<AlgebraicSpelling> {
    
    @Override
    public Class<AlgebraicSpelling> getSpelling() {
        return AlgebraicSpelling.class;
    }

}

Cake.

As for the TerminalSpelling enumeration?

package oberlin.algebra.builder;

import java.util.LinkedList;
import java.util.List;
import java.util.logging.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import oberlin.builder.TerminalSpelling;
import oberlin.builder.TerminalSpellingHandler;
import oberlin.builder.parser.ast.AST;
import oberlin.algebra.builder.nodes.*;

/*
 * Major changes:
 * For starters, if a chunk of the string matches a pattern, then that group
 * needs to be returned as the spelling of a token.
 * <Specific Atomic Type> ← Token ← Terminal ← AST
 * 
 * All tokens should contain a reference to their regular expression/grammar.
 * 
 */
public enum AlgebraicSpelling implements TerminalSpelling {
        //COMMENTS
        WHITESPACE(Pattern.compile("^\\s+"), GrammarType.COMMENT, new TerminalSpellingHandler<Whitespace>(){

            @Override
            public Whitespace getTerminal(String spelling) {
                return new Whitespace(spelling);
            }
            
        }),
        BLOCK_COMMENT(Pattern.compile("^/\\*.*?\\*/"), GrammarType.COMMENT, new TerminalSpellingHandler<BlockComment>(){

            @Override
            public BlockComment getTerminal(String spelling) {
                return new BlockComment(spelling);
            }
            
        }),
        LINE_COMMENT(Pattern.compile("^//.*+$"), GrammarType.COMMENT, new TerminalSpellingHandler<LineComment>(){

            @Override
            public LineComment getTerminal(String spelling) {
                return new LineComment(spelling);
            }
            
        }),
        
        //VALIDS
        NOMINAL(Pattern.compile("^[\\D&&\\w]\\w+"), GrammarType.KEEP, new TerminalSpellingHandler<Nominal>(){

            @Override
            public Nominal getTerminal(String spelling) {
                return new Nominal(spelling);
            }
            
        }),
        NUMERIC(Pattern.compile("^\\d+"), GrammarType.KEEP, new TerminalSpellingHandler<Numeric>(){

            @Override
            public Numeric getTerminal(String spelling) {
                return new Numeric(spelling);
            }
            
        }),
        OPERATOR(Pattern.compile("^[+-/\\\\÷\\*×\\^]"), GrammarType.KEEP, new TerminalSpellingHandler<Operator>(){

            @Override
            public Operator getTerminal(String spelling) {
                return new Operator(spelling);
            }
            
        }),
        EQUATOR(Pattern.compile("^!?=?[=><]"), GrammarType.KEEP, new TerminalSpellingHandler<Equator>(){

            @Override
            public Equator getTerminal(String spelling) {
                return new Equator(spelling);
            }
            
        }),
        
        //DELIMITERS
        LPAREN(Pattern.compile("^\\("), GrammarType.KEEP, new TerminalSpellingHandler<LParen>(){

            @Override
            public LParen getTerminal(String spelling) {
                return new LParen(spelling);
            }
            
        }),
        RPAREN(Pattern.compile("^\\)"), GrammarType.KEEP, new TerminalSpellingHandler<RParen>(){

            @Override
            public RParen getTerminal(String spelling) {
                return new RParen(spelling);
            }
            
        })
    ;
    
    //PRIVATE FIELDS
    private Logger logger = Logger.getLogger("AlgebraicSpelling");
    
    //ENUMERATIONS
    private enum GrammarType {
        KEEP,
        COMMENT;
    }
    
    //FIELDS
    private final Pattern pattern;
    private final GrammarType type;
    private final TerminalSpellingHandler<?> handler;
    
    //CONSTRUCTORS
    /**
     * The two underlying details for any syntax pattern are its regular expression, and its semantic meaning.
     * The pattern is simply a compiled regular expression for matching the item. The type is a clue to its
     * semantic meaning, which is checked in a switch-case statement when called.
     * 
     * @param pattern compiled regular expression identifying the token
     * @param type clue as to what should be done with the token once found
     */
    private AlgebraicSpelling(Pattern pattern, GrammarType type, TerminalSpellingHandler<?> handler) {
        this.pattern = pattern;
        this.type = type;
        this.handler = handler;
    }
    
    //GETTERS/SETTERS
    @Override
    public Pattern getPattern() {
        return pattern;
    }

    @Override
    public List<AST> manageToken(Matcher matcher) {
        List<AST> ret = new LinkedList<>();
        switch(this.type) {
        case KEEP:
            ret.add(handler.getTerminal(matcher.group()));
            break;
        case COMMENT:
            //Just ignore it
            break;
        }
        return ret;
    }

}

A bit longer, but as you can see, still very simple. Before I address the minor details reserved for a language implementation, allow me to give you the trivial definition of TerminalSpellingHandler:

package oberlin.builder;

public interface TerminalSpellingHandler<E extends Terminal> {
    public E getTerminal(String spelling);
}

Barely even worth a class, really. It could easily just be a java.util.function.Function<String, E extends Terminal>. It remains a candidate for expansion, should you need it; and unlike Function, can easily be altered to throw a non-runtime exception. (Should you find yourself needing to throw such a thing.)

So, by my recommended structure, the enumerations contain three simple things. Specific java.util.regex.Patterns describing how they appear (always beginning with the metacharacter “^” to denote the beginning of the String), an enumeration describing what to do with the token once it’s found, and a functional interface that acquires a Terminal from the token. I will concede that some of these items are candidates for defaults in the interface, but as you know, not all languages are structured the same and for the moment, I leave this to you.

Our End Result

Ultimately, we have a class that returns a List<AST> full of Terminals, which is exactly what our Parser works with. This allows for redundant calls and stripped levels of processing that allow for a much cleaner Θ-notation.

Additionally, the class captures almost every recurrent portion of Scanner that I can think of, and is easily and quickly extended for whatever your target language might be. This is the greater service of a properly designed class—it takes work and stress off of the programmer. It is likewise going to make the next step much easier.

Now we build us a Parser.

 
 

Tags: , , , , , , , , ,

Software Language Engineering: Terminals, Nonterminals, and Bears (Early Edition)

The Notion of a Terminal AST

So, in the last chapter, I explained what an AST was structurally. There are formally two kinds of extensions to it. I usually implement them in their own classes, extending the base class of AST.

The first is the terminal. If you’ve programmed in Java for even a month, you know that having a method which accepts two different kinds of unrelated classes in the stead of one another is a bad idea for all kinds of reasons.

It is, actually, possible.

That just doesn’t mean that you should do it.

ASTs are formally called trees, but what they are is nodes on a tree. A program is a single AST, with an extension typically called “Program” as the overarching root node. The branch nodes, or nodes with child nodes, are called non-terminal nodes; the end points are called terminal nodes.

Each of those tokens that your Scanner kicked out? Yeah, that’s a terminal node, disguised as a String.

Let me offer you some of my code for ASTs, Terminals, and NonTerminals. (As before, there are major issues that I’m leaving out until later on. See if you can catch them.)

package oberlin.builder.parser.ast;

import oberlin.builder.codegenerator.RuntimeEntity;
import oberlin.builder.visitor.*;

/**
 * Abstract Syntax Tree, capable of representing any sequence of 
 * statements or the entire program.
 * 
 * @author © Michael Eric Oberlin Nov 3, 2014
 *
 */
public interface AST {
    /**
     * @return number of sub-elements contained in this tree node.
     */
    public int getElementCount();
}
package oberlin.builder;

import oberlin.builder.parser.ast.AST;

/**
 * Basis of all complete abstract syntax trees. Terminals are basically isolated-tokens known only by their spellings.
 * 
 * @author © Michael Eric Oberlin Nov 5, 2014
 *
 */
public class Terminal implements AST {
    private final String spelling;
    
    public Terminal(String spelling) {
        this.spelling = spelling;
    }
    
    public final String getSpelling() {
        return this.spelling;
    }

    @Override
    public String toString() {
        return getSpelling();
    }

    @Override
    public final int getElementCount() {
        return 1;
    }
}

Let those soak in a little. Note that each Terminal has a length of one, meaning that it is the only member of its immediate tree. That will be important when we develop our Parser.

A terminal is an instance of an AST, and can be created by simply passing its token to it. The token is stored in the field “spelling”. Terminal is also fully extensible, as even though their token is consistently their only members, there is a significant difference between an equals operator, a binary operator, and numerical data; and nonterminal nodes take that difference quite seriously.

The Notion of a Nonterminal AST

A nonterminal AST is an AST built not from characters in a String, but from a sequence of other ASTs. The constituent ASTs can be terminals, or nonterminals. Remember BNF? Ever listed item before may-consist-of (::=) was a nonterminal. In instances of single-member representation, such as:

Noun ::= ProperNoun

“Noun” is still a nonterminal, as the implication is that (at least in theory) multiple types of items can be its constituents.

The “Program” node is of course always a nonterminal. I’ve written a nice slice of code for them, too.

package oberlin.builder;

import java.util.*;

import oberlin.builder.parser.ast.AST;

public abstract class NonTerminal implements AST {
    private final List<AST> astList;
    
    public NonTerminal(AST... astList) throws MismatchException {
        if(!checkTypes(astList)) throw new MismatchException("Nonterminal class " + this.getClass() + " does not match " +
                "expression.");
        
        List<AST> list = new ArrayList();
        for(AST ast : astList) {
            list.add(ast);
        }
        this.astList = list; 
    }
    
    public NonTerminal(List<AST> astList) throws MismatchException {
        try {
            this.astList = resolveTypes(astList);
        } catch(BuilderException ex) {
            throw new MismatchException(ex);
        }
    }
    
    public abstract List<Class<? extends AST>> getExpectedASTTypes();
    
    /**
     * Check to see that all provided ASTs are some extension of the expected class of AST,
     * and create the internal list of ASTs from it if possible.
     * 
     * @param astList current list of program ASTs 
     * @return true if the first ASTs match the expected ones, false otherwise
     */
    private List<AST> resolveTypes(List<AST> astList) throws BuilderException {
        List<AST> ownASTs = new ArrayList<>();
        List<Class<? extends AST>> astTypes = getExpectedASTTypes();
        
        for(int i = 0; i < astTypes.size(); i++) {
            Class<? extends AST> provided, expected;
            provided = astList.get(i).getClass();
            expected = astTypes.get(i);
            if(!expected.isAssignableFrom(provided)) {
                throw new BuilderException("Cannot get " + expected + " from " + provided.getClass());
            }
            ownASTs.add(astList.get(i));
        }
        return ownASTs;
    }
    
    /**
     * Check to see that all provided ASTs are some extension of the expected class of AST.
     * 
     * @param astList current array of program ASTs 
     * @return true if the first ASTs match the expected ones, false otherwise
     */
    private boolean checkTypes(AST... astList) {
        List<Class<? extends AST>> astTypes = getExpectedASTTypes();
        
        for(int i = 0; i < astList.length; i++) {
            Class<? extends AST> provided, expected;
            provided = astList[i].getClass();
            expected = astTypes.get(i);
            if(!provided.equals(expected)) {
                return false;
            }
        }
        return true;
    }
    
    @Override
    public int getElementCount() {
        return astList.size();
    }
}

This is where the critical detail is left out of my code. If you already see it, then congratulations! In any case, I’ll let you all in on it at the end of the chapter. It goes back to the advantage of regular expressions over manually checking lists.

In “Programming Language Processors in Java”, David A. Watt and Deryck F. Brown use a visitor pattern to scan for these. That’s a perfectly valid approach. I get the same advantages through my constructor-or-exception pattern, which you have seen before. In fact, if you’re careful, you may notice that it’s another format of the same general pattern; without the traditional Visitor and Element classes. In my github repo, you might notice that I have the outline of a Visitor pattern implemented on these. Pay it no mind, it is proving unnecessary.

Still a good idea. Just unnecessary.

What Do They Mean to a Parser?

A Parser’s job is thus. Iterate through a list of Terminal syntax tree nodes. Compare the beginning of the list to established members of a list of nonterminal nodes. Find a match.

Truncate the matched beginning of the list of ASTs; this is, remove every element that is a member of the new NonTerminal. Now that they are removed, append that NonTerminal to the beginning of the list, in their place. Repeat this process until the list of ASTs is of size one, and return that singular AST.

And of course, if the program cannot be parsed down to a singular AST, throw a Parser Error.

Caveats

These are meant to be identified through members of an enumeration; just as tokens were. However, you might remember BNF items such as:

Command ::= Command+

That is, a variable number of Command nodes can be condensed into the immediate tree of a single Command node. This is an issue for a simple list of ASTs, as in its provided implementation, NonTerminal simply attempts to match a fixed sequence of AST types to a provided list of ASTs, one by one. This is almost always inadequate for a programming language.

I’m going to step around writing a whole new series on how regular expressions work. (For now.) The important detail is that a minimum and maximum number of items, extending a specific AST class, need to be matched. I was tempted to call these “Clauses”, but that just barely relates to the actual definition of “clause”; so instead, we’ll borrow from regular expression terminology and call them Groups.

The down side? I’m still implementing Groups. They will have their own chapter though.

Now, we learn how to build a Scanner. (Again. Sort of.)

 

Tags: , , , , , , , , , , , , ,

A Mixed Blessing

Git reads me my sins
I pull from ancient repos
Code no man should see


		
 
1 Comment

Posted by on November 3, 2014 in Poetry, Programming

 

Tags: , , , , , , ,