RSS

Software Language Engineering: Intro and the Scanner

12 Oct

(Early Edition)

[My project has evolved a little since I started this. Allow me to make a few modifications for clarity.]

Introduction

So, back to Software Language Engineering. Well, not really; I literally just wrote the preface, and I’m a little burnt out. However, I think it’s only fair that I walk as much as I talk, so allow me to begin this by explaining the core concept of a scanner, and providing you with a little example code. This entry will be fair game for revision for a very long time; but that’s for the sake of humility, not inaccuracy. I might even separate it into two distinct parts; one on theory, the other on implementation. If anyone’s voting for that discretion, please mention it in the comments. Also note that I am reserving the right to back-propagate any major change I make to any part of this program later on; thus even after completion this article may never be static. However, I do promise to compile and run the code myself before putting it up. You may quite liberally scream at me in the comments if I still muck something up.

The Scanner

The notion of a build tool is to take code, often but not always human-written, and convert it to a set of native software structures that can be understood and executed by a CPU. The two initial phases of that, one of which I will be addressing here, are scanning, and parsing. Scanning is the earliest phase, an operation which is far too easy to accidentally amalgamate with parsing. Scanning is separating the code into tokens with atomic meaning to the language. They are not organized or checked for semantic sensibility, they are simply separated and cleaned up a little.

For the human end, it’s often easier to allow as much whitespace as we care to use, wherever we should care to put it. It’s easy to do, it’s streamlined, and it isn’t hard to read. For the later stages of a typical build tool, unless otherwise specified, whitespace is noise. It is syntax without semantic, ignorable data that serves no purpose. It stops at the scanner. Likewise, a build tool is not typically concerned with the content of a structure like a text string. It is concerned with where it begins, and where it ends; its literal contents are ultimately irrelevant.

Additionally, comments are treated much like whitespace. Whether your comment character is a double forward slash “//” or a hash sign “#”, you don’t want your program worrying about what’s between it and the end of the line. These are also responsibilities of the scanner. If you’re smart, you’ll be setting up abstract classes to contain these. Most languages (every one of them that I can think of off of the top of my head) can be interpreted in this manner, so you are likely to use them many times over.

We’re going to begin with an example builder for basic algebra, with the end goal of simplifying the equation. I suggest that you create four packages in your project, as I have: oberlin.builder, oberlin.builder.scanner, oberlin.algebra.builder, and oberlin.algebra.builder.scanner. (My own code, actively under development, can be found on Github, but please note that I have pushed this to Github only sporadically of late; given the nature of the tutorial, I am more concerned with pushing working code to it than whatever I’m experimenting with.) In builder, create a launch class called Main. This will be your testing facility.

All code © Michael Eric Oberlin, October 11th 2014; and subject to the terms of the GNU General Public License Agreement.

package oberlin.builder;

/*
 * © Michael Eric Oberlin, October 11, 2014
 * 
 * This program is free for educational usage, and free to be modified in any manner; however
 * its creator cannot be held responsible for any damage done to end-users' computers from it
 * or any alteration of it.
 */
import java.io.*;
import java.util.*;

import oberlin.algebra.builder.*;
import oberlin.algebra.builder.scanner.*;

public class Main {
    
//    private Builder builder = new NullaryBuilder();
    private Builder builder = new AlgebraicBuilder();
    
    public static void main(String...args) throws IOException, BuilderException {
        new Main();
    }
    
    public Main() throws IOException, BuilderException {
        try(BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
                BufferedWriter out = new BufferedWriter(new OutputStreamWriter(System.out));) {
            
            out.write("Type \"!exit\" to exit the program\n");
            String input = "";
            Object built = null;
            while(!input.trim().toLowerCase().equals("!exit")) {
                if(input.length() != 0) {
                    built = builder.build(input);
                    
                    out.write("Builds to: " + built);
                    out.newLine();
                    out.flush();
                }
                
                out.write("Enter text: ");
                out.flush();
                input = in.readLine();
            }
        }
    }
}

The notion here is to have the least likely token that I can think of, bang-exit, be used to exit the loop. You could remove that tag, if you wanted to rely on Control-C or Eclipse’s red square to exit; but generally it’s safe to assume that not every end user has, or knows, an easy way out of an arbitrary program.

I am using BufferedReader and BufferedWriter because they’re easier to work with than System.in and System.out directly; and in all but the most desperate debugging situations, System.out.print* is fairly deprecated code—a one-line anti-pattern, really. It’s 2014, people! Learn to use loggers!

You may also note that you have two Scanner objects declared, one of which is commented out. I strongly recommend creating a Nullary* version of every interface, in place of using null. Null is more a natural state than an idea, and the inclination to use it to express any other meaning tends to lead to predicaments. (In Java 8, if you must use null, or contend with old code that uses null, consider wrapping each instance in Optionals.)

So, on running this program, you are continually presented with an inquiry for a line of text. Our build target is that line. On entry, it is sent to a Builder, which returns an object. The program reports that object, and you’re good to go for another line entry. Generally this structure is programmer-directed, but it isn’t a bad basis for something along the lines of an inline interpreter, such as the one Python uses. All of our experiments will stem from this program.

So, what is Builder?

Builder is a class that encapsulates the build process in entirety. It will reference the scanner, the parser, and every other major component. For end users, it is a “black box”; something identified entirely by its function. The programs “javac”, “rhino”, and “g++” could all stand in as a Builder. We’ll begin with an abstraction of it, like so.

package oberlin.builder;

import java.util.List;
import java.util.logging.*;

import javax.naming.OperationNotSupportedException;

import oberlin.algebra.builder.nodes.Program;
import oberlin.builder.scanner.*;
import oberlin.builder.parser.*;
import oberlin.builder.parser.ast.*;

public abstract class Builder {
    private Logger logger = Logger.getLogger("Builder");
    private Scanner<?> scanner = new NullaryScanner();
    private Parser<?> parser;
    
    public Object build(String code) throws BuilderException {
        List<AST> tokens = (List<AST>) scanner.apply(new Terminal(code, new SourcePosition()));
        
        parser = createParser(tokens);
        
        //We now have a list of Terminals, ready to be constructed
        //into an AST via a Parser.
        
        //NOTE: This is technically an exclusive part of the algebra
        //builder. It should be abstracted out.
        AST program = parser.parse(Program.class);
        
        //DEBUG
        analyzeTree(program);
        
        //TODO: Generate Object Program with code generator
        //TODO: Return Object Program
        
        //TODO: At this point, objectProgram should be our completed and
        //runnable binary. The following code is more for
        //embedded testing than anything else, and should be removed 
        //before release.
        
        StringBuilder szBuilder = new StringBuilder();
        for(AST token : tokens) {
            szBuilder.append(token).append("; ");
        }
        return szBuilder.toString();
        
    }
    
    //TODO: MICK: Remove redundancies before release.
    public AST getParseTree(String code) {
        List<AST> tokens = (List<AST>) scanner.apply(
            new Terminal(code, new SourcePosition()));
        
        parser = createParser(tokens);
        
        //We now have a list of Terminals, ready to be constructed
        //into an AST via a Parser.

        //NOTE: This is technically an exclusive part of the algebra
        //builder. It should be abstracted out.
        AST program = parser.parse(Program.class);
        
        return program;
    }
    
    public Parser<?> createParser(List<AST> tokens) {
        return new NullaryParser(tokens);
    }

    public Scanner<?> getScanner() {
        return scanner;
    }

    public void setScanner(Scanner<?> scanner) {
        this.scanner = scanner;
    }

    protected Parser<?> getParser() {
        return parser;
    }

    protected void setParser(Parser<?> parser) {
        this.parser = parser;
    }
    
    //DEBUG
    private void analyzeTree(AST ast) {
        try {
            System.out.println(ast.getContainedNodeNames());
            
            for(AST node : ast.getContainedNodes()) {
                analyzeTree(node);
            }
        } catch(OperationNotSupportedException ex) {
            System.out.println(ex.getMessage());
        }
    }
    
}

[I’m sure you see the skips and bumps characteristic of a work in progress.]

Calling the build(…) method will in turn run each component, beginning with the implemented scanner, and derive a program object from the code. So, given the rough outline of a scanner’s function, this is my recommendation for your Scanner class. A scanner’s job is not to interpret code. Keep that in the back of your mind. It is, succinctly, to divide it into a sequence of specialized tokens usually called “lexemes”. Its job is also not to determine the meaning of those lexemes, that task belongs to a parser.

Confession

One of the first complete textbooks I ever read, on the direct subject of building a build tool, was “Programming Language Processors in Java — Compilers and Interpreters”, by David A. Watt and Deryck F. Brown, Prentice Hall. It’s an excellent introduction to the subject, and I had some familiarity before I first picked it up; but it was written in 2000.

That was back in the Java 2 days. Those were dark times. A lot of wonderful and exciting things have happened in the past ten years or so, and a decent part of my compulsion to write this tutorial is to take my old methods and, well, refurbish them for the modern era. My initial draft of this code had some embarrassing parts. It scanned not only for comments and valid tokens, but for delimiters (separately), whitespace (in spite of its identical treatment to a comment), and a lot of other nonsense.

Please forgive any relic of a less experienced time that slips into my new material; Java 2 was before we even had enumerations, let alone generics, and if you were going to pick up any bad habits, it was a great time for it. I actually wrote a haiku about that. It was a long time ago, and the longer the better. Java 8, or even Java 6, is much better suited to writing such material.

End Confession

Scanners aren’t ultimately that different from Parsers, so if I wanted to, I could jump to Parser, have it back reference a specialization of itself, and write out the rules in Backus-Naur Form. But if I did that, you guys would be really mad at me and no better off, so let’s go through it logically.

Every language differs by grammar, and Scanner’s interest in that grammar is key spellings for the language lexicon. The best way to represent these spellings is through an enumeration of self-contained checks against a regular expression. It is also possible to store them in a map, but I find that to be extraneous here.

Let’s start with the outline of the abstracts.

package oberlin.builder.scanner;

import java.util.*;
import java.util.function.*;
import java.util.logging.*;

import oberlin.builder.*;
import oberlin.builder.parser.ast.*;

/**
 * New design for scanners. Requires an enumeration of type
 * TerminalSpelling to iterate over, and a code string.
 * Returns a list of tokens.
 * 
 * @author © Michael Eric Oberlin Nov 2, 2014
 *
 */
public interface Scanner<E extends Enum<E> & TerminalSpelling>
    extends Function<AST, List<AST>> {
    @Override
    public default List<AST> apply(AST code) {
        Logger logger = Logger.getLogger("Scanner");
        
        //Start with a list of tokens, with the singular pre-token of the bulk of code
        List<AST> tokens = new LinkedList<>();
        tokens.add(code);
        
        //For each item found in the code, parse it, and replace
        //the contents of tokens with the parsed contents
        for(int index = 0; index < tokens.size(); index++) {
            //maintain for check
            AST origToken = tokens.get(index);
            
            List<AST> newTokens = new ArrayList<>();
            for(TerminalSpelling terminalSpelling : getSpelling().getEnumConstants()) {
                
                try {
                    newTokens.addAll(terminalSpelling.matchToken(tokens.get(index)));
                    
                    replaceItemWithCollection(tokens, index, newTokens);
                    break;
                } catch (MismatchException e) {
                    //didn't match, so continue to the next item
                    continue;
                }
            }
            
            //Final defensive check: if one token was received, and one token provided, and
            //the old is not the same as the new, then the only thing that happened was the
            //removal of irrelevant data. Thus, index should be decremented so that this new
            //item may be scanned again.
            if(newTokens.size() == 1 && !newTokens.get(0).equals(origToken)) index--;
        }
        
        //This algorithm will always terminate the token list with an empty token, the one item
        //which cannot have semantic value. So, remove it here.
        tokens.remove(tokens.size() - 1);
        
        return tokens;
    }
    
    /**
     * Internally used method which replaces an entry in a
     * provided list with a collection.
     * 
     * @param list list to be altered
     * @param entry numeric index of replaced entry
     * @param replacement item to insert in place of current data
     */
    static <E> void replaceItemWithCollection(List<E> list,
        int entry, Collection<E> replacement) {
        list.remove(entry);
        list.addAll(entry, replacement);
    }
    
    /**
     * 
     * @return TerminalSpelling allocated to token recognition.
     */
    public Class<E> getSpelling();
}
package oberlin.builder;

import java.util.LinkedList;
import java.util.List;
import java.util.regex.*;

import oberlin.builder.parser.SourcePosition;
import oberlin.builder.parser.ast.*;

public interface TerminalSpelling {
    
    public default List<AST> matchToken(AST ast) throws MismatchException {
        List<AST> returnable = new LinkedList<>();
        
        Pattern pattern = getPattern();
        Matcher matcher = pattern.matcher(ast.toString());
        if(matcher.find()) {
            returnable.addAll(manageToken(matcher, ast.getPosition()));
            returnable.add(new Terminal(ast.toString().substring(matcher.end()), ast.getPosition()));
            return returnable;
        } else throw new MismatchException("String \"" + ast + "\" does not match grammar pattern + \"" + pattern + "\"");
    }
    
    public Pattern getPattern();
    
    /**
     * Determines, often through enumeration self-inspection, what should be done with a passed token.
     * 
     * @param token the original token removed from the complete String by matchToken
     * @return appropriate value given the circumstances and implementation of TerminalSpelling.
     */
    public List<AST> manageToken(Matcher matcher, SourcePosition position);
}

There is one item here that I might spend a few lines of English on. In Scanner, there is one chunk that you might not recognize, unless you’re some kind of an enumerations ninja, or an experimenter like me.

for(TerminalSpelling terminalSpelling : getSpelling().getEnumConstants())

The important detail is to see to it that for every implementation, getSpelling() always returns the class of the enumeration. This class has a method, getEnumConstants(), which returns a list of all members of the enumeration (provided that the class is an enumeration). Therefore, this line allows for the paging through of every single type of TerminalSpelling in the enumeration, one by one, via external iteration.

So what is TerminalSpelling in detail? Each spelling is an example of a valid or ignorable token, and how it should look. In a properly formed program in your target language, every single chunk, regardless of whether it ends up in a token or not, should match these TerminalSpelling members. Scanner is easy enough, all you really need to do to override it (in this instance, at least) is set a spelling rule for it to follow. Your language might have a self-modification feature, so making it final may not be a good idea, but the default will work for most languages. As for TerminalSpelling, that’s where the interesting stuff happens. Take, as an example, our implementation for algebraic simplification:

package oberlin.algebra.builder;

import java.util.LinkedList;
import java.util.List;
import java.util.logging.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import oberlin.builder.TerminalSpelling;
import oberlin.builder.TerminalSpellingHandler;
import oberlin.builder.parser.SourcePosition;
import oberlin.builder.parser.ast.AST;
import oberlin.algebra.builder.nodes.*;

/*
 * Major changes:
 * For starters, if a chunk of the string matches a pattern, then that group
 * needs to be returned as the spelling of a token.
 * <Specific Atomic Type> ← Token ← Terminal ← AST
 * 
 * All tokens should contain a reference to their regular expression/grammar.
 * 
 */
public enum AlgebraicSpelling implements TerminalSpelling {
        //COMMENTS
        WHITESPACE(Pattern.compile("[\\s&&[^\\r\\n]]+"/*"^\\s+"*/), GrammarType.COMMENT, new TerminalSpellingHandler<Whitespace>(){
            //NOTE: We are looking for all whitespace other than newlines! Thus, \r and \n must be ignored.
            //Hence, the strange regex above. (To look for the same item in two places has undefined and
            //generally unpleasant behavior.)
            @Override
            public Whitespace getTerminal(String spelling, SourcePosition position) {
                return new Whitespace(spelling, position);
            }
            
        }),
        BLOCK_COMMENT(Pattern.compile("^/\\*.*?\\*/"), GrammarType.COMMENT, new TerminalSpellingHandler<BlockComment>(){

            @Override
            public BlockComment getTerminal(String spelling, SourcePosition position) {
                return new BlockComment(spelling, position);
            }
            
        }),
        LINE_COMMENT(Pattern.compile("^//.*+$"), GrammarType.COMMENT, new TerminalSpellingHandler<LineComment>(){

            @Override
            public LineComment getTerminal(String spelling, SourcePosition position) {
                return new LineComment(spelling, position);
            }
            
        }),
        
        //VALIDS
        NOMINAL(Pattern.compile("^[\\D&&\\w]\\w+"), GrammarType.KEEP, new TerminalSpellingHandler<Nominal>(){

            @Override
            public Nominal getTerminal(String spelling, SourcePosition position) {
                return new Nominal(spelling, position);
            }
            
        }),
        NUMERIC(Pattern.compile("^\\d+"), GrammarType.KEEP, new TerminalSpellingHandler<Numeric>(){

            @Override
            public Numeric getTerminal(String spelling, SourcePosition position) {
                return new Numeric(spelling, position);
            }
            
        }),
        OPERATOR(Pattern.compile("^[+-/\\\\÷\\*×\\^]"), GrammarType.KEEP, new TerminalSpellingHandler<Operator>(){

            @Override
            public Operator getTerminal(String spelling, SourcePosition position) {
                return new Operator(spelling, position);
            }
            
        }),
        EQUATOR(Pattern.compile("^!?=?[=><]"), GrammarType.KEEP, new TerminalSpellingHandler<Equator>(){

            @Override
            public Equator getTerminal(String spelling, SourcePosition position) {
                return new Equator(spelling, position);
            }
            
        }),
        
        //DELIMITERS
        LPAREN(Pattern.compile("^\\("), GrammarType.KEEP, new TerminalSpellingHandler<LParen>(){

            @Override
            public LParen getTerminal(String spelling, SourcePosition position) {
                return new LParen(spelling, position);
            }
            
        }),
        RPAREN(Pattern.compile("^\\)"), GrammarType.KEEP, new TerminalSpellingHandler<RParen>(){

            @Override
            public RParen getTerminal(String spelling, SourcePosition position) {
                return new RParen(spelling, position);
            }
            
        }),
        //NOTE: "\\r?\\n" also qualifies as whitespace! Remove it from the whitespace search!
        NEWLINE(Pattern.compile("\\n"), GrammarType.KEEP, new TerminalSpellingHandler<NewLine>(){
            @Override
            public NewLine getTerminal(String spelling, SourcePosition position) {
                return new NewLine(spelling, position);
            }
        })
    ;
    
    //PRIVATE FIELDS
    private Logger logger = Logger.getLogger("AlgebraicSpelling");
    
    //ENUMERATIONS
    private enum GrammarType {
        KEEP,
        COMMENT;
    }
    
    //FIELDS
    private final Pattern pattern;
    private final GrammarType type;
    private final TerminalSpellingHandler<?> handler;
    
    //CONSTRUCTORS
    /**
     * The two underlying details for any syntax pattern are its regular expression, and its semantic meaning.
     * The pattern is simply a compiled regular expression for matching the item. The type is a clue to its
     * semantic meaning, which is checked in a switch-case statement when called.
     * 
     * @param pattern compiled regular expression identifying the token
     * @param type clue as to what should be done with the token once found
     */
    private AlgebraicSpelling(Pattern pattern, GrammarType type, TerminalSpellingHandler<?> handler) {
        this.pattern = pattern;
        this.type = type;
        this.handler = handler;
    }
    
    //GETTERS/SETTERS
    @Override
    public Pattern getPattern() {
        return pattern;
    }

    @Override
    public List<AST> manageToken(Matcher matcher, SourcePosition position) {
        List<AST> ret = new LinkedList<>();
        switch(this.type) {
        case KEEP:
            ret.add(handler.getTerminal(matcher.group(), position));
            break;
        case COMMENT:
            //Just ignore it
            break;
        }
        return ret;
    }

}

So, every member is initialized to a specific regex pattern, and a notion of whether it is meant to be kept, or thrown out. You may also note that pure whitespace, aside from any new line character, is treated as a comment; this may not be the case for your language, but it is for most. In Java, as an example, whitespace only serves to delimit one token from another before compilation, unless it’s surrounded by other delimiters like quotation marks.

Scanner has a List, initially containing the bulk code. This code is passed, one token at a time, to the each member of TerminalSpelling; when one of them matches it, they return a new list of internal tokens. Those new tokens are spliced into the original list where the old token was.

Unless the recognizing spelling returned a single token that is, somehow, different from the token passed in (such as when whitespace or a comment is removed), index is incremented; otherwise, measures are in place to keep the index the same, and the token is scanned again until it is a single token in and of itself.

In the end, of course, the last psuedo-token (the remaining bulk of code) will be (or should be) an empty string. It serves no further purpose, so it is removed. At this point, we have a stream of tokens, ready to be parsed.

Afterthoughts

See that? Cheesecake. My original code, based on tripe I coughed out long ago, involved a lot of unnecessary steps. Simple code is functional code; code with more effort than strictly necessary results in at least two things: programmer cognitive fatigue, and often-unrepeatable or nigh-untraceable errors. If I have any further mistakes that I need to correct, please inform me in the comments. I’ll be updating the Github repo with the new code soon. Enums are a beautiful thing.

So Am I Finished?

Maybe. If you’re just looking for a text scanner, and no further formatting needs to be done to the token, then yeah, you’re ready to go. If, however, you are looking at attaching this thing to a parser, then there’s one more thing we need to add. It’s not big, but it makes life easier. It’s Terminal AST formatting for each token. I’ll explain that in a later chapter, after I’ve justified the need for it. It’s written into the Scanner class, but it’s conceptually part of the Parser module. Essentially, you’re at least pretty much done.

Advertisements
 
5 Comments

Posted by on October 12, 2014 in Software Language Engineering

 

Tags: , , , , , , , , ,

5 responses to “Software Language Engineering: Intro and the Scanner

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: