Remember how in part one, I said that there was one final change to make to the Scanner? Here it is. Now that you know about BNF, ASTs, Nonterminals, and Terminals, I can comfortably show you my completed Scanner.
The Big Difference
The difference here is that this Scanner will not return a List<String>, so much as a List<AST> full of Terminals. Each Terminal will contain one token, that is, the String that used to be in its place. It really is a minor structural change, but it is an important one.
The only other option is to scan through the list of Strings again, match each of them, and produce a List<AST>; and that’s just inefficient. We should not need to iterate through the list twice when we can get the job done with a single pass. I have rules about efficient code, and so should you.
The Scanner (for real this time)
package oberlin.builder.scanner; import java.util.*; import java.util.function.*; import java.util.logging.*; import oberlin.builder.*; import oberlin.builder.parser.ast.*; /** * New design for scanners. Requires an enumeration of type * TerminalSpelling to iterate over, and a code string. * Returns a list of tokens. * * @author © Michael Eric Oberlin Nov 2, 2014 * */ public interface Scanner<E extends Enum<E> & TerminalSpelling> extends Function<AST, List<AST>> { @Override public default List<AST> apply(AST code) { Logger logger = Logger.getLogger("Scanner"); //Start with a list of tokens, with the singular pre-token of the bulk of code List<AST> tokens = new LinkedList<>(); tokens.add(code); //For each item found in the code, parse it, and replace the contents of tokens //with the parsed contents for(int index = 0; index < tokens.size(); index++) { //maintain for check AST origToken = tokens.get(index); List<AST> newTokens = new ArrayList<>(); for(TerminalSpelling terminalSpelling : getSpelling().getEnumConstants()) { try { newTokens.addAll(terminalSpelling.matchToken(tokens.get(index))); replaceItemWithCollection(tokens, index, newTokens); break; } catch (MismatchException e) { //didn't match, so continue to the next item continue; } } //Final defensive check: if one token was received, and one token provided, and //the old is not the same as the new, then the only thing that happened was the //removal of irrelevant data. Thus, index should be decremented so that this new //item may be scanned again. if(newTokens.size() == 1 && !newTokens.get(0).equals(origToken)) index--; } //This algorithm will always terminate the token list with an empty token, the one item //which cannot have semantic value. So, remove it here. tokens.remove(tokens.size() - 1); return tokens; } /** * Internally used method which replaces an entry in a provided list with a collection. * * @param list list to be altered * @param entry numeric index of replaced entry * @param replacement item to insert in place of current data */ static <E> void replaceItemWithCollection(List<E> list, int entry, Collection<E> replacement) { list.remove(entry); list.addAll(entry, replacement); } /** * * @return TerminalSpelling allocated to token recognition. */ public Class<E> getSpelling(); }
So that’s your Scanner, almost exactly the same, but now it uses an enumeration of a type called TerminalSpelling. Its structure might also look a little familiar.
package oberlin.builder; import java.util.LinkedList; import java.util.List; import java.util.regex.*; import oberlin.builder.parser.ast.*; public interface TerminalSpelling { public default List<AST> matchToken(AST ast) throws MismatchException { List<AST> returnable = new LinkedList<>(); Pattern pattern = getPattern(); Matcher matcher = pattern.matcher(ast.toString()); if(matcher.find()) { returnable.addAll(manageToken(matcher)); returnable.add(new Terminal(ast.toString().substring(matcher.end()))); return returnable; } else throw new MismatchException("String \"" + ast + "\" does not match grammar pattern + \"" + pattern + "\""); } public Pattern getPattern(); /** * Determines, often through enumeration self-inspection, what should be done * with a passed token. * * @param token the original token removed from the complete String by matchToken * @return appropriate value given the circumstances and implementation of TerminalSpelling. */ public List<AST> manageToken(Matcher matcher); }
And how does this new structure related to our pilot programming language, the Algebraic Simplifier?
package oberlin.algebra.builder.scanner; import java.util.List; import oberlin.algebra.builder.AlgebraicSpelling; import oberlin.builder.scanner.Scanner; public class AlgebraicScanner implements Scanner<AlgebraicSpelling> { @Override public Class<AlgebraicSpelling> getSpelling() { return AlgebraicSpelling.class; } }
Cake.
As for the TerminalSpelling enumeration?
package oberlin.algebra.builder; import java.util.LinkedList; import java.util.List; import java.util.logging.*; import java.util.regex.Matcher; import java.util.regex.Pattern; import oberlin.builder.TerminalSpelling; import oberlin.builder.TerminalSpellingHandler; import oberlin.builder.parser.ast.AST; import oberlin.algebra.builder.nodes.*; /* * Major changes: * For starters, if a chunk of the string matches a pattern, then that group * needs to be returned as the spelling of a token. * <Specific Atomic Type> ← Token ← Terminal ← AST * * All tokens should contain a reference to their regular expression/grammar. * */ public enum AlgebraicSpelling implements TerminalSpelling { //COMMENTS WHITESPACE(Pattern.compile("^\\s+"), GrammarType.COMMENT, new TerminalSpellingHandler<Whitespace>(){ @Override public Whitespace getTerminal(String spelling) { return new Whitespace(spelling); } }), BLOCK_COMMENT(Pattern.compile("^/\\*.*?\\*/"), GrammarType.COMMENT, new TerminalSpellingHandler<BlockComment>(){ @Override public BlockComment getTerminal(String spelling) { return new BlockComment(spelling); } }), LINE_COMMENT(Pattern.compile("^//.*+$"), GrammarType.COMMENT, new TerminalSpellingHandler<LineComment>(){ @Override public LineComment getTerminal(String spelling) { return new LineComment(spelling); } }), //VALIDS NOMINAL(Pattern.compile("^[\\D&&\\w]\\w+"), GrammarType.KEEP, new TerminalSpellingHandler<Nominal>(){ @Override public Nominal getTerminal(String spelling) { return new Nominal(spelling); } }), NUMERIC(Pattern.compile("^\\d+"), GrammarType.KEEP, new TerminalSpellingHandler<Numeric>(){ @Override public Numeric getTerminal(String spelling) { return new Numeric(spelling); } }), OPERATOR(Pattern.compile("^[+-/\\\\÷\\*×\\^]"), GrammarType.KEEP, new TerminalSpellingHandler<Operator>(){ @Override public Operator getTerminal(String spelling) { return new Operator(spelling); } }), EQUATOR(Pattern.compile("^!?=?[=><]"), GrammarType.KEEP, new TerminalSpellingHandler<Equator>(){ @Override public Equator getTerminal(String spelling) { return new Equator(spelling); } }), //DELIMITERS LPAREN(Pattern.compile("^\\("), GrammarType.KEEP, new TerminalSpellingHandler<LParen>(){ @Override public LParen getTerminal(String spelling) { return new LParen(spelling); } }), RPAREN(Pattern.compile("^\\)"), GrammarType.KEEP, new TerminalSpellingHandler<RParen>(){ @Override public RParen getTerminal(String spelling) { return new RParen(spelling); } }) ; //PRIVATE FIELDS private Logger logger = Logger.getLogger("AlgebraicSpelling"); //ENUMERATIONS private enum GrammarType { KEEP, COMMENT; } //FIELDS private final Pattern pattern; private final GrammarType type; private final TerminalSpellingHandler<?> handler; //CONSTRUCTORS /** * The two underlying details for any syntax pattern are its regular expression, and its semantic meaning. * The pattern is simply a compiled regular expression for matching the item. The type is a clue to its * semantic meaning, which is checked in a switch-case statement when called. * * @param pattern compiled regular expression identifying the token * @param type clue as to what should be done with the token once found */ private AlgebraicSpelling(Pattern pattern, GrammarType type, TerminalSpellingHandler<?> handler) { this.pattern = pattern; this.type = type; this.handler = handler; } //GETTERS/SETTERS @Override public Pattern getPattern() { return pattern; } @Override public List<AST> manageToken(Matcher matcher) { List<AST> ret = new LinkedList<>(); switch(this.type) { case KEEP: ret.add(handler.getTerminal(matcher.group())); break; case COMMENT: //Just ignore it break; } return ret; } }
A bit longer, but as you can see, still very simple. Before I address the minor details reserved for a language implementation, allow me to give you the trivial definition of TerminalSpellingHandler:
package oberlin.builder; public interface TerminalSpellingHandler<E extends Terminal> { public E getTerminal(String spelling); }
Barely even worth a class, really. It could easily just be a java.util.function.Function<String, E extends Terminal>. It remains a candidate for expansion, should you need it; and unlike Function, can easily be altered to throw a non-runtime exception. (Should you find yourself needing to throw such a thing.)
So, by my recommended structure, the enumerations contain three simple things. Specific java.util.regex.Patterns describing how they appear (always beginning with the metacharacter “^” to denote the beginning of the String), an enumeration describing what to do with the token once it’s found, and a functional interface that acquires a Terminal from the token. I will concede that some of these items are candidates for defaults in the interface, but as you know, not all languages are structured the same and for the moment, I leave this to you.
Our End Result
Ultimately, we have a class that returns a List<AST> full of Terminals, which is exactly what our Parser works with. This allows for redundant calls and stripped levels of processing that allow for a much cleaner Θ-notation.
Additionally, the class captures almost every recurrent portion of Scanner that I can think of, and is easily and quickly extended for whatever your target language might be. This is the greater service of a properly designed class—it takes work and stress off of the programmer. It is likewise going to make the next step much easier.
Now we build us a Parser.