RSS

Category Archives: Software Language Engineering

Software Language Engineering: Analysis

(Early Edition)

So, at this point we have a clearly defined Backus-Naur form for our grammar, a working scanner for terminal tokens, and a working parser for non-terminal generation and the construction of the abstract syntax tree.

The biggest hurdles are over. However, they weren’t quite the last. One thing that must be done before any compiler or interpreter can be built is the construction of a utility for analysis. In this section, I’ll be describing the basic contract for a contextual analyzer. In the next section, I’ll be showing you some example code.

Introduction

Some of the analysis was already done. The core concept to keep in mind, when building a build tool, is the fewer the passes, the quicker your tool will run. Basic syntax errors have already been detected; as they prevent the construction of the AST for our language. However, you might notice that there are a few obtuse things that you can still do.

Enter “3 + 4 = 2” into the algebra tool, and you’ll notice that it will gobble it down just fine; even though it is concretely incorrect. This is where the second phase of analysis comes in.

Can we sweep for these while we generate the abstract syntax tree? Does your processor have more than one core? Then we absolutely can. Even if you are using a single-core, the penalty would be rather mild. However, it is important to recognize that code analysis is the work of a separate module.

Types of Analysis

There are two major forms of analysis to worry about: syntactic analysis, and contextual analysis.

Syntactic Analysis

Syntactic analysis is almost always the responsibility of the parser. Contextual analysis depends on it. Syntactic analysis is the process that generates the phrase structure of the program; without it, further phases are obliquely impossible. It’s more commonly known as parsing, and if you’re following this tutorial in sequence, you’ve already done it. If not, there are four preceding chapters dedicated to explaining it in detail.

Generally, I recommend, on the establishment of a syntax error during syntactic analysis, simply skipping the node and checking for what might be next. This is not an issue for small programs, much less one-line programs; but for larger utilities and libraries it is vanishingly uncommon for the number of bugs to be limited to one. Often, knowledge of the effect on another, later, mistake is critical to the creation of a satisfactory solution,

As a side effect of continuing the scan, the error reporter may have a hundred additional syntax errors to report, even though they all reference the same mistake. This can explode exponentially. Accordingly, for a final edition of a builder, it is best to limit the number of reported errors before the program calls it quits. For Javac, the limit is a hundred errors, unless the -Xmaxerrs and -Xmaxwarns flags are set to a higher value.

On the completion of syntactic analysis, without error, we have a singular tree with a single root node, most commonly called Program. If syntactic analysis does not complete properly, it is still possible to proceed to contextual analysis, but no further, as erroneous code has an arbitrary interpretation. Computers require determinism.

Contextual Analysis

So, as of contextual analysis, we have a complete abstract syntax tree. The remaining question is, does the correctly formed code also conform to the controls of the language? Is a symbol used before declaration, when the language demands that it not be? Is a variable used outside of its proper scope? Are there duplicate declarations, without any rule for how to handle those declarations? The general rule is that if you cannot establish the analysis rule in BNF, then it is contextual.

After the contextual analyzer has completed its task, given that there are no show-stopping errors, it returns an AST as well. In this case, it is what’s known as a decorated syntax tree. Every user-defined symbol will maintain a node reference to its declaration in the AST, Every expression, for a language concerned about type, is demarcated with its result type.

You may remember, from the introduction to Backus-Naur Form, that it was designed for “context-free grammars”. The term “contextual analysis” more literally means analyzing extensions to the grammar that supersede the domain of BNF.

The best way to think of a proper decorated syntax tree is as an abstract syntax tree, with which any node can be taken at random and read from beginning to end, which forms a complete, definite, and concrete statement.

Procedure of Analysis

Like every class, we must begin with a concrete description of its contract. This includes its responsibilities, and the resources made available to it. Its responsibility, in broad summary, is to find every occurrence of a contextual unknown and link it to its definition. Resources include the code itself as an abstract syntax tree, and a concrete error reporter.

Every analysis tool, the parser included, must be initialized with an error reporter. It is not recommended to make the error reporting functionality ingrained to the class, as it is often best the same error reporter used by parser (your syntax analyzer), and functionally, it has a very different contract—one class, for one responsibility.

We again apply the visitor pattern, much as we do for syntax analysis. Is it possible to use the same visitor pattern for both syntax analysis and context analysis? Technically, yes, but it is discouraged, as syntactical analysis and contextual analysis are two separate contracts. It is possible to feed the incomplete abstract syntax tree to a waiting context analyzer, but this is a tactic more sophisticated than we are ready for at this juncture. I’ll probably return to it in the final section.

To my knowledge, there is not yet a BNF-equivalent for non-context-free grammars that can easily be used for context analysis. This is not to say that there are none; if you insist on following the same pattern that you did for syntax analysis, you may consider Noam Chomsky‘s formal grammar. It uses a lot of unconventional symbols, so you may also consider getting accustomed to using a compose key.

As formal grammars, unless you are working with a set of people who are fully informed on their usage, go well outside of the bounds of this tutorial, I suggest considering the depth of complexity of your contextual grammar before resorting to them. What you will definitely need is a clear and inarguable description of what these rules are, even if it is in plain English.

The context analyzer will also, for most languages, be creating an identification table as it works. Perhaps your target language does not use variables, and has no need for one; I am assuming that it does. It is also possible that your target language does not mind late definitions, as long as there are eventually definitions. It would not be the first. For my algebra solver, I am currently assuming that it does mind; but later on, perhaps I’ll reformat it so that it doesn’t. Subsequent definitions, or even a loosely related concept called “late binding”, It isn’t as hard to do as you might initially think.

Summary Abstractions

We’ll need an abstraction of the core context analyzer. While I chose to call the syntax analyzer “Parser”, a more common term, there is no equivalent that I am aware of for the context analyzer. Thus, we’ll call it “ContextAnalyzer”. I propose a single method in ContextAnalyzer, called check(AST ast). This will initiate the visitor pattern.

Once I complete the code, I’ll highlight it to you in the next lesson.

Advertisements
 

Tags: , , , , , , , , , , , , , ,

Software Language Engineering: Establishing a Parser, Part Two (Early Edition)

Introductory

So, you’ve ready part one, and you’re at least familiar with a visitor pattern, right? If not, I strongly encourage reading the two injected sections first.

A parser delegates the vast majority of its work to a Visitor. More appropriately stated, it depends upon the Visitor in order to do its work, as the Visitor is responsible for creating the requested nodes.

PhraseStructure classes

I have a simple extension of Visitor which I have created purely for the sake of future modifications. It’s called PhraseStructure. At the moment, it looks like this:

package oberlin.builder.parser;

import oberlin.builder.*;
import oberlin.builder.parser.ast.AST;
import oberlin.builder.visitor.Visitor;

import java.util.*;

public interface PhraseStructure extends Visitor {
}

….which makes it a marker interface. However, should you or I choose to add specific behavior to the Visitor which strictly relates to this program, it’s an excellent low-footprint stand-in.

The point where it, and by that I also mean Visitor, shows its worth is in AlgebraicPhraseStructure.

package oberlin.algebra.builder.parser;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.function.BiFunction;

import oberlin.builder.parser.Parser;
import oberlin.builder.parser.PhraseStructure;
import oberlin.builder.parser.SourcePosition;
import oberlin.builder.parser.ast.AST;
import oberlin.builder.parser.ast.EOT;
import oberlin.algebra.builder.nodes.*;

public class AlgebraicPhraseStructure implements PhraseStructure {
    
    private Map<Class<? extends AST>, BiFunction<Parser<?>, 
        SourcePosition, ? extends AST>> map = new HashMap<>();
    {
        map.put(Program.class, new BiFunction<Parser<?>,
                SourcePosition, AST>() {
            @Override
            public Program apply(Parser<?> parser,
                    SourcePosition position) {
                Program program = null;
                SourcePosition previous =
                    parser.getPreviousTokenPosition();
                AST currentToken = parser.getCurrentToken();
                
                Equality equality = (Equality) parser.getVisitor()
                        .visit(Equality.class, parser, previous);
                program = new Program(previous, equality);
                
                if(!(currentToken instanceof EOT)) {
                    parser.syntacticError("Expected end of program",
                            currentToken.getClass().toString());
                }
                
                return program;
            }
        });
        map.put(Equality.class, new BiFunction<Parser<?>,
            SourcePosition, AST>() {

            @Override
            public AST apply(Parser<?> parser,
                    SourcePosition position) {
                Equality equality = null;
                List<AST> nodes = new ArrayList<>();
                SourcePosition operationPosition =
                    new SourcePosition();
                
                parser.start(operationPosition);
                //parse operation
                AST operation = parser.getVisitor().visit(
                        Operation.class, parser, operationPosition);
                nodes.add(operation);
                if(parser.getCurrentToken() instanceof Equator) {
                    nodes.add(parser.getCurrentToken());
                    parser.forceAccept();
                    nodes.add(parser.getVisitor().visit(
                            Operation.class, parser,
                            operationPosition));
                } else {
                    parser.syntacticError("Expected: equator",
                            Integer.toString(
                                parser.getCurrentToken().getPosition()
                                .getStart()));
                }
                parser.finish(operationPosition);
                
                equality = new Equality(operationPosition, nodes);
                return equality;
            }
            
        });
        map.put(Operation.class, new BiFunction<Parser<?>,
            SourcePosition, AST>() {

            @Override
            public AST apply(Parser<?> parser,
                SourcePosition position) {
                
                Operation operation = null;
                List<AST> nodes = new ArrayList<>();
                SourcePosition operationPosition =
                    new SourcePosition();
                
                parser.start(operationPosition);
                //parse identifier
                AST identifier = parser.getVisitor().visit(
                        Identifier.class,
                        parser, operationPosition);
                nodes.add(identifier);
                //look for operator
                if(parser.getCurrentToken() instanceof Operator) {
                    nodes.add(parser.getCurrentToken());
                    parser.forceAccept();
                    nodes.add(parser.getVisitor().visit(
                            Operation.class,
                            parser, operationPosition));
                }
                parser.finish(operationPosition);
                
                operation = new Operation(operationPosition, nodes);
                return operation;
            }
            
        });
        map.put(Identifier.class, new BiFunction<Parser<?>,
            SourcePosition, AST>() {

            @Override
            public AST apply(Parser<?> parser,
                SourcePosition position) {
                
                Identifier identifier = null;
                List<AST> nodes = new ArrayList<>();
                SourcePosition identifierPosition = new SourcePosition();
                
                parser.start(identifierPosition);
                if(parser.getCurrentToken() instanceof LParen) {
                    nodes.add(parser.getCurrentToken());
                    parser.forceAccept();
                    
                    nodes.add(getHandlerMap().get(Operation.class)
                            .apply(parser, identifierPosition));
                    parser.accept(Operation.class);
                    
                    nodes.add(parser.getCurrentToken());
                    parser.accept(RParen.class);
                } else if(parser.getCurrentToken()
                    instanceof Nominal) {
                    nodes.add(parser.getCurrentToken());
                    parser.forceAccept();
                } else if(parser.getCurrentToken()
                    instanceof Numeric) {
                    nodes.add(parser.getCurrentToken());
                    parser.forceAccept();
                } else {
                    parser.syntacticError(
                            "Nominal or numeric token expected",
                            parser.getCurrentToken().getClass()
                                 .toString());
                }
                parser.finish(identifierPosition);
                identifier =
                    new Identifier(identifierPosition, nodes);
                
                return identifier;
            }
            
        });
    }
    
    @Override
    public Map<Class<? extends AST>, BiFunction<Parser<?>,
        SourcePosition, ? extends AST>> getHandlerMap() {
        // TODO Auto-generated method stub
        return map;
    }
    
}

For all of the code, you’ll note that there’s only one method. getHandlerMap() returns a map, intrinsic to the PhraseStructure, which maps classes (of any extension of AST) to functions which return them. These functions, specifically BiFunctions, accept only a Parser, with all of its delicious utility methods, and a SourcePosition so that they have an idea where they’re looking. All necessary data is in those two items alone.

A Note on Source Position

If you’ve been paying very close attention, you may have noticed that SourcePosition isn’t strictly necessary to translate. You’re right, mostly; but when something goes wrong, it is SourcePosition which tells you where the problem showed up, and what you need to tinker with in order to properly format the program.

It wasn’t always like this. Early compilers would simply indicate that the program was misformatted. More likely, just print “ERROR”, as the notion of software development (which didn’t involve punching holes in punch cards) was relatively young, and ethics weren’t really a thing yet.

This wasn’t a big deal, while programs were generally only a few lines and had exceedingly small lexicons of keywords. When Grace Murray Hopper put together A-0, the idea of adding sophisticated error reporting would have seemed like over-programming; mostly because it would have been over-programming.

As time went on, and machines got more sophisticated, having an error in your code could take days to find. If you had more than one error, then you were really in trouble. So, eventually, a team came up with the idea of reporting the exact point where the format failed, and history was made. (I’m not sure who that was, so if anyone knows, please inform me through the comments.)

Today, every well-designed AST is aware of exactly where it, or its constituents, begin and end. If you want to be especially sophisticated, you can have it remember line number, and even character number, too.

Our current edition of our algebra-like language is generally one-line-only and relatively domain specific, but memorization of where the ASTs go wrong provides room for growth.

Visit Handlers

If you don’t remember specifically, Visitor’s visit is not a complicated method.

public default AST visit(Class<? extends AST> element,
        Parser<?> parser, SourcePosition position) {
    AST ast = getHandlerMap().get(element).apply(parser, position);
    return ast;
}

It simply retrieves the map, grabs the BiFunction associated with the provided element, and applies it to the parser and an initial source position. From there, all work goes on in the map.

The visit handlers themselves can get pretty messy, if you aren’t careful. They begin by initializing their specific brand of AST to null. A NullaryAST or an Optional might be better here, as I have a serious aversion to methods that can return null, but I haven’t made that change yet. This AST is the item which will be initialized through the context of local nodes.

Next, a SourcePosition is initialized. This will be the element passed to the constructor for our AST. When Parser.start(SourcePosition) is called, it updates the starting point of SourcePosition. When Parser.finish(SourcePosition) is called, it updates the end point. These are set to Parser’s currently known coordinate in the code. Thus, before anything else is done, Parser.start(…) is called.

After the SourcePosition has been started, the class of each token is checked against allowed conditions. As such, the bulk of these methods are conditionals. It’s here that I must explain the usage of Parser.accept(…) and Parser.forceAccept().

Parser.accept(…) checks the class of the current token against the provided one, and if they match, increments the internal pointers. If not, it reports a syntactic error, and leaves the pointers alone. Since the pointer is left alone, additional nodes can still be parsed, and multiple errors can be caught, even in the sake of a token simply being missing or skipped. Parser.forceAccept() always accepts the current node, regardless of its type, and increments the pointers. (In fact, it is called from within accept(…) after the conditional checks are completed.

Once all possibilities have been checked for, the AST is initialized and returned. If at any point no possibilities remain for this token, a syntax error is thrown, and the program continues to parse (even though it cannot complete the tree).

Is There Another Way to Do This?

There’s always another way, but that doesn’t mean that it’s necessarily better. One method might be catching customized exceptions on a distinct schedule, which also works pretty well; the down side is that it only allows for the detection of a single error at a time. Another would be the construction of a string representing the AST types, and usage of a regular expression on it; but as I’ve said before, the construction of improper code, even if it compiles, can create devastatingly slow regular expressions at seemingly arbitrary times.

I’ve experimented with both on the way to this code, which is precisely why writing this section took so much longer than the others. There are probably dozens of other readily available methods which I haven’t even thought of yet. One of them, somewhere, might even be faster or sufficiently more effective than the visitor pattern.

This is not me saying that the visitor pattern is perfect, either. This implementation of visitor has a lot of marks against it. It is extremely tightly coupled, for starters, as loose as the interface alone may be. It uses “instanceof” all over the place, which begs for the implementation of further methods to keep to an OOP standard. It has many anonymous classes around, which substantially increase the memory footprint. The slightest of mistakes in the layout of the visitor functions will result in an unbounded recursion, which will quickly and almost silently crash your program, so it is not recommended for avant garde programming—always start with a properly reduced Backus Naur Form of your language. I could go on, such as with the many potential issues with secondary delegation, which the visitor pattern survives on, but this more than covers it.

My advice? Use ethical method names, comment until your fingers bleed, trigger exceptions everywhere the program pointer shouldn’t be, and benchmark benchmark benchmark. In select cases, the Visitor is still your friend, provided that you treat it like a sacred relic wired to a bomb.

Final Notes

You may notice that this is awfully similar to the enumeration used for scanning. You can, in fact, create a Scanner from a Parser, by treating every character as an individual token. However, this has not been done in a long time, as regular expressions are quite reliable for cases like this. I may yet develop a Scanner from a Parser, but only as an example, this does not mean that I recommend it.

You can think of the individual differences between one language and another as the impulse behind these enumerations and mappings. Parser will always be Parser, PhraseStructure will always be PhraseStructure. However, when you need to compile a specific language into an AST tree, the features that make that language what it is can all be stored in the enumerations and maps. Because of this, this API allows for rapid construction of builders.

Next, we talk about identification tables.

 

Tags: , , , , , , , , , , , , , ,

Visual Feedback on an Abstract Parsing Tree with JavaFX

I honestly didn’t expect to be writing this, but it seems fair.

In the past few editions, I’ve been discussing the AST. It can be overwhelmingly complicated for a complete program; so I’ve been using a simple, single line equation as the sample. Unfortunately, that isn’t very realistic; and it would be very helpful to have a procedurally generated visual tree available. That tree is what this lesson is all about.

At first I considered using a graphical tree style, like javax.swing.JTree; but that can be painfully over-simplistic in itself. I would prefer to outline the material the same way I would draw it on a white board (which, if you’re wondering, I do). The best way to do this? JavaFX.

JavaFX whiteboard abstract syntax tree

Graphical AST tree rendering, through JavaFX/F3

If you aren’t familiar with JavaFX, please do me a favor and tolerate the name. It was originally F3, for Form-Follows-Function. I kind of liked F3, until some marketer decided that “JavaFX” sounded better. Functionally speaking, it’s an excellent revision on how user interfaces are designed in Java. I fully stand behind it. It allows for XML structuring and CSS styling, just like a web page, to more hard-coded controls. This is much, much faster; and it allows for significant beauty in user interfaces. However, it works very differently from things like Swing and AWT; and while I’m certain that it isn’t the first API to do so, it takes some getting used to.

I fully intend to write a true tutorial on all of JavaFX on some point. Do you need to understand it to understand translators? Absolutely not. However, this code does work. It is not part of the Github repository, as it is technically a tangential project; but the same license (GNU GPL) applies to it and you are welcome to copy it token for token. I’ll put it up on Github as I get the chance. I’ll make a few minor comments along the way to help you follow it.

1. The Basic Application

We have exceedingly few needs for our app. It simply reads a program from a stream, parses it, and feeds the parse tree to a custom node, which displays it graphically. Accordingly, the program code is rather small. I’ll begin by displaying it, then I’ll spend a moment piecing it together in English for you.

package oberlin.builder.gui;

import oberlin.algebra.builder.AlgebraicBuilder;
import oberlin.builder.parser.ast.AST;
import javafx.application.*;
import javafx.scene.*;
import javafx.scene.layout.*;
import javafx.stage.*;

public class GUIMain extends Application {

    public static void main(String...args) {
        launch(args);
    }

    @Override
    public void start(Stage primaryStage) throws Exception {
        Pane root = new Pane();
        root.getStyleClass().add("backing");
        
        Scene scene = new Scene(root);
        scene.getStylesheets().add(GUIMain.class.getResource("tree.css").toExternalForm());
        
        primaryStage.setScene(scene);
        root.setMinWidth(640.0);
        root.setMinHeight(480.0);
        
        populate(root);
        
        primaryStage.show();
        primaryStage.centerOnScreen();
    }
    
    private void populate(Pane p) {
        /*
         * This is simply an example, so I've ignored input for now.
         * In theory, you would replace the line below (containing
         * hard code) with an input loop.
         */
        AST program = (new AlgebraicBuilder()).getParseTree("1+2");
        
        p.getChildren().add(new GUITree(program));
    }
}

All JavaFX/F3 programs begin with Application.launch(String…args). JavaFX programs run in what is effectively their own thread, and more so than with Swing-based programs. Launch parses arguments and stores them in their own object, appropriately called Parameters. They can be accessed, at any point later on, via Application.getParameters(). Our available overloads and customizations cut out for a moment, then come back in in the start(Stage) method.

Stage is basically where Frame would be; but it’s a little more complicated than that. Unlike Swing and AWT, which were designed to be platform independent, JavaFX is designed to be hardware context independent. What you are writing here will work equally well on a PC, tablet, and smart phone; as well as anything else built (now or later) that maintains a JavaFX compatibility standard. Thus, what might otherwise be called a frame or window is referred to as a stage, as it might be neither of those things.

You’ll notice that the instantiated Pane is given a style class. If you aren’t familiar with CSS, a style class is what’s used to differentiate between one element and any number of others which, otherwise, would look exactly like it. Thus, it allows CSS to pick and choose which elements of the layout it is styling at a given moment. I’ve chosen “backing” as the name for this element, as it is the backboard of our tree. You will also note that, two lines later, the CSS file itself is loaded.

Next, a Scene is created. Scenes are critically important, and distinct from stages. While a stage represents the context that the layout is drawn in, the scene represents the actual controls and constraints within that space. Thus, while many aspects of Stage are immutable (and unknowable), Scene allows for greater flexibility. JavaFX sees to it that they correspond, so don’t worry about that.

Scene is styled through its root element, which in this case is our pane. You’ll notice that instead of the stricter setWidth() and setHeight() that you might be familiar with from Swing, we are setting a minimum on these bounds. That minimum is not guaranteed, as the display may not be capable of it, but it is treated as a general rule to be followed if at all possible. In this case, I’m going for classic analog low-def TV resolution, 640 width by 480 height. (Looking back, those numbers might be inadequate, but for now they’re quite functional.) If this is too small for you, the frame—if it is a frame, anyway—is easily resizable.

Populate() is a method I wrote to add the paraphernalia to the scene; but note that afterwards we call show(). This is very important, as otherwise our stage will be constructed in memory, but never displayed to the screen. Additionally, there will be no way to kill the JavaFX thread save for a hard interrupt. Once shown, the closing of the primary stage will flag the program to terminate.

1.1. Populate

It’s a generally good habit, but not a necessary one, to populate your frame in a separate and dedicated method. This is what I do here, even though for the moment, I only have one control to add.

The AST method should be old news; it’s a stand-in, for the moment, for an actual code-reading portion. (I’m assuming that you’re looking to compile more than just “1+2”.) GUITree is a custom JavaFX node, which I will explain next. Note that to add a node to a program, you must take some structure (not yet visible) in the scene graph (stemming from your chosen root), and get its Children as a modifiable list. Then, you must add that node to the list.

Note that after a stage is visible, precious little of the scene can be changed save for through the constraints built into it. I’m not going to touch on Expressions and Bindings here, but know that if you pull something that doesn’t play by JavaFX’s rulebook, it will throw an ApplicationException and your program will not launch. Thankfully, while exceedingly picky, that rulebook is small. If you call show() and then try and add a child, you will have problems; it must be the other way around.

If you’re curious, hiding a rendered stage does not count for making it modifiable. You must give it your entire concept first, then make it visible. If you’re familiar with OpenGL, you’ll already understand why.

2. The Tree Itself

The tree is a custom JavaFX node, which I admit is rarely necessary. Still, most of the entities that make it work are core to the API.

package oberlin.builder.gui;

import oberlin.builder.parser.ast.AST;
import javafx.geometry.BoundingBox;
import javafx.geometry.Bounds;
import javafx.geometry.Point2D;
import javafx.scene.control.Tooltip;
import javafx.scene.layout.*;
import javafx.scene.paint.Color;
import javafx.scene.shape.CubicCurve;

import java.util.function.IntSupplier;

public class GUITree extends AnchorPane {
    private Bounds bounds = new BoundingBox(0, 0, 640, 480);
    private AnchorPane framing = new AnchorPane();
    private double edgeSize = 0.10;    //ten percent additional length beyond edges of framing
    
    public GUITree(AST ast) {
        this.setMinWidth(bounds.getWidth() * (1 + edgeSize));
        this.setMinHeight(bounds.getHeight() * (1 + edgeSize));
        
        configureFraming();
        
        addNode(ast);
    }
    
    private void configureFraming() {
        framing.setLayoutX(edgeSize * (bounds.getWidth() / 2.0));
        framing.setLayoutY(edgeSize * (bounds.getHeight() / 2.0));
        framing.setMinWidth(bounds.getWidth());
        framing.setMinHeight(bounds.getHeight());
        
        this.getChildren().add(framing);
    }
    
    private ASTNode addNode(AST ast) {
        return this.addNode(ast, new Marker(0), new Counter(), 0, null);
    }
    
    private ASTNode addNode(AST ast, IntSupplier stepsDown, IntSupplier stepsAcross, int index, ASTNode parent) {
        ASTNode node = new ASTNode(ast, stepsDown.getAsInt(), stepsAcross.getAsInt());
        
        //AnchorPane stuff
        calculateAnchoring(node, parent);
        
        framing.getChildren().add(index ++, node);
        
        final StringBuilder tooltipText = new StringBuilder();
        IntSupplier across = new Counter();
        for(AST kid : ast.getContainedNodes()) {
            tooltipText.append(kid.getClass().getSimpleName()).append(" ");
            ASTNode child = addNode(kid,
                    new Marker(stepsDown.getAsInt() + 1),
                    across,
                    index,
                    node);
            CubicCurve cubic = createCubicCurve(node.getNoodleRoot(), child.getTopCenter());
            framing.getChildren().add(cubic);
        }
        node.getType().setTooltip(new Tooltip(tooltipText.toString()));
        
        return node;
    }
    
    private CubicCurve createCubicCurve(Point2D p1, Point2D p2) {
        CubicCurve curve = new CubicCurve();
        
        curve.setStartX(p1.getX());
        curve.setStartY(p1.getY());
        
        curve.setEndX(p2.getX());
        curve.setEndY(p2.getY());
        
        curve.setControlX1(p1.getX());
        curve.setControlY1(p2.getY());
        
        curve.setControlX2(p2.getX());
        curve.setControlY2(p1.getY());
        
        curve.setStroke(Color.BLACK);
        curve.setStrokeWidth(2.0);
        curve.setFill(Color.TRANSPARENT);
        
        return curve;
    }
    
    private void calculateAnchoring(ASTNode node, ASTNode parent) {
        node.setOrigin(new Point2D(parent == null ? (bounds.getWidth() - node.getBounds().getWidth())/2.0 :
            justifyX(node, parent), justifyY(node)));
        AnchorPane.setTopAnchor(node, node.getOrigin().getY());
        AnchorPane.setLeftAnchor(node, node.getOrigin().getX());
    }
    
    private Double justifyX(ASTNode node, ASTNode parent) {
        final double parentCenter = (parent.getOrigin().getX() + (parent.getBounds().getWidth() / 2.0)
                + parent.getNoodleRoot().getX()) / 2.0;
        final double center = parentCenter
                - node.getBounds().getWidth()
                        * (parent.getAST().getElementCount()) / 2.0; 
        return center + node.getOrigin().getX();
    }
    
    private Double justifyY(ASTNode node) {
        return node.getOrigin().getY();
    }
}

That was a bit much at once, I know. The central pane, called “framing”, is 640 by 480. Framing is offset in each direction by a 5% inset, via the convenient features of AnchorPane.

AnchorPane is one of the few prepared ways to control where a node is rendered, with precision, in JavaFX. You may often need to keep your own tabs on where it is rendered, as getMinX() and getMaxX() will return zero more often than you will believe. However, through direct layout control, you can still manage them.

The method addNode(…) adds a custom object called ASTNode. I’ll cite it for you here.

package oberlin.builder.gui;

import javafx.collections.FXCollections;
import javafx.collections.ObservableList;
import javafx.geometry.BoundingBox;
import javafx.geometry.Bounds;
import javafx.geometry.Point2D;
import javafx.geometry.Pos;
import javafx.scene.control.Label;
import javafx.scene.layout.StackPane;
import javafx.scene.layout.VBox;
import javafx.scene.text.TextAlignment;
import oberlin.builder.parser.ast.AST;

class ASTNode extends VBox {
    private Bounds bounds = new BoundingBox(0, 0, 100, 40);
    private Point2D origin = new Point2D(0, 0);
    private final double expanse = 1.10;
    
    private final AST ast;
    
    private Label type;
    private Label hash;
    
    private ObservableList<ASTNode> kids = FXCollections.observableArrayList();
    
    public ASTNode(AST ast) {
        this.ast = ast;
        type = new Label(ast.getClass().getSimpleName().toString());
        type.setTextAlignment(TextAlignment.CENTER);
        type.setAlignment(Pos.CENTER);
        
        hash = new Label(Long.toHexString(ast.hashCode()).toUpperCase());
        hash.setTextAlignment(TextAlignment.CENTER);
        hash.setAlignment(Pos.CENTER);
        
        VBox vbox = new VBox(new StackPane(type), new StackPane(hash));
        vbox.getStyleClass().add("node");
        vbox.setMinWidth(bounds.getWidth());
        vbox.setMinHeight(bounds.getHeight());
        this.getChildren().add(vbox);
        
        for(AST kid : ast.getContainedNodes()) {
            addKid(new ASTNode(kid));
        }
        
    }
    
    public Point2D getNoodleRoot() {
        return new Point2D(getOrigin().getX() + (getBounds().getWidth() / 2),
                getOrigin().getY() + getBounds().getHeight());
    }

    public ASTNode(AST ast, int level) {
        this(ast);
        
        origin = new Point2D(0, level * bounds.getHeight() * expanse);
    }
    
    public ASTNode(AST ast, int levelDown, int levelAcross) {
        this(ast);
        
        origin = new Point2D(getStepAcrossSize(levelAcross), getStepDownSize(levelDown));
    }
    
    public double getStepDownSize(int steps) {
        return steps * bounds.getHeight() * expanse;
    }
    
    public double getStepAcrossSize(int steps) {
        return steps * bounds.getWidth() * expanse;
    }
    public void addKid(ASTNode astNode) {
        this.kids.add(astNode);
    }
    
    public ObservableList<ASTNode> getKids() {
        return kids;
    }
    
    public Bounds getBounds() {
        return bounds;
    }
    
    public Point2D getOrigin() {
        return origin;
    }
    
    public Point2D getTopCenter() {
        return new Point2D(
                getOrigin().getX() + (getBounds().getWidth()/2),
                getOrigin().getY());
    }
    
    public Label getType() {
        return type;
    }
    
    public void setOrigin(Point2D p) {
        this.origin = p;
    }
    
    public AST getAST() {
        return this.ast;
    }
}

ASTNode is a JavaFX Node as well. It simply maintains a reference to the AST itself, and the general presentation of that AST on the tree. There isn’t a lot here. If you’re wondering what VBox is, it’s an abbreviation for “vertical box”. (Naming a class after an abbreviation is bad practice, but it’s long since done by powers above me; I tolerate it as much as I do “AST”.)

Speaking of bad practice, this would ideally actually use Bindings, but I wrote this in a bit of a rush today and will have to correct that in the future. It is also bad practice to repeat data, which is exactly what this program is doing by re-storing the label text in a separate field. All the same…

I’m going to gloss over a lot of the configuration of the labels, as it’s relatively standard. Know that like any other pane in JavaFX, a VBox can be initialized with a list of its bounded nodes; also, a StackPane has the default behavior of centering its own bounded nodes.

The last thing done in the constructor is the creation of additional ASTNodes for each child node of the abstract syntax tree.  Each of them, in turn, renders their own children. This is not perfect, there is a substantial chance that two lists of nodes will overlap one another; however, it is already excellent for debugging visitor pattern based content. In the end, the GUITree renders each node in an assigned place, with a curved cubic line (technically called a “noodle”) connecting it to its parent and its children.

How does it do that? With IntSuppliers.

3. The IntSuppliers

There are only two of these.

package oberlin.builder.gui;

import java.util.function.IntSupplier;

/**
 * For downward counts; always returns provided number.
 * 
 * @author © Michael Eric Oberlin Dec 23, 2014
 *
 */
class Marker implements IntSupplier {
    private int fix;
    
    public Marker(int fix) {
        this.fix = fix;
    }
    
    @Override
    public int getAsInt() {
        return fix;
    }
}


package oberlin.builder.gui;

import java.util.function.IntSupplier;

/**
 * For counts across; always returns next consecutive number.
 * 
 * @author © Michael Eric Oberlin Dec 23, 2014
 *
 */
class Counter implements IntSupplier {
    private int count;
    
    @Override
    public int getAsInt() {
        return count++;
    }
    
}

IntSuppliers (and really all Suppliers) are part of the java.util.function package, new to Java 8. The great advantage of this package is that a function, or any functional interface, allows you to specify a method that serves as a primitive with conditionally defined values. I know that’s a leap, but I’ve been doing it since long before it was formally adopted into the language and it’s a central totem of functional languages.

We could, in theory and practice, use incrementing and decrementing integers in place of either of these. The problem is that the code gets a lot longer and a lot more cluttered when you do. I prefer the sublime simplicity of packing such behavior into an interface.

Of course, these are not everything. There is one, final, issue.

4. What was that that you said about “CSS”?

The CSS is specific to JavaFX; a complete listing of all of the properties is available here. If you are unfamiliar with the syntax of CSS, you can find an excellent tutorial on it (for HTML, at least) at W3Schools. It isn’t as versatile as Java or C, but its creators pulled many of its properties from C-like languages.

tree.css:

.backing {
    -fx-background-color: lightyellow;
    -fx-insets: 0;
    -fx-padding: 15;
    -fx-spacing: 10;
}

.node {
    -fx-background-color: lightblue;
    -fx-background-radius: 5.0;
    -fx-border-color: black;
    -fx-border-radius: 5.0; 
}

Keep this in the same folder as GUIMain, and it will find it as written.

The CSS styling of JavaFX controls is capable of everything HTML 5 is and then some. It’s an excellent fusion of programming and markup. I encourage you to play with the layout of GUIMain’s scene, and the actual program fed to the builder.

 

Tags: , , , , , , , , , , , , , , , ,

Software Language Engineering: The Visitor Pattern

This may feel like a slight detour; but believe me, it’s a necessary one. If you are already fully familiar with the Visitor pattern, you are free to skip this section.

The Visitor pattern is hardly the only way to handle grammar parsing; but I’ve been trying to find a better one for a couple of weeks now, and without much success. Thus, I must recommend it.

The Visitor pattern functions as a way to operate on a structure of objects, like our Abstract Syntax Tree, while evading the encapsulation of the algorithm within those objects. It is an approach from the outside, it visits; thus, the name. Unlike most design patterns, you can actually implement the approach, for most purposes, in a single package. I’ll explain how to do than in the second part of this addition.

1. The Recursive Nature of the Visitor Pattern

First, a word on the unusual structure.

Every visitable object is referred to as an “element”; and every element has an inherent method which receives a single visitor object as its parameter. Additionally, every visitor object has a method which receives an element as its singular parameter.

This may sound rather circular, but the meat of the work is done by the visitor’s method. When visit is called, with a specific instance of the element as its parameter type, a window opens to perform specific operations on the available materials (fields and methods) in the element. This window does not require any specific change to the element’s code.

So when is this visit method called? Every time the element’s inherent visitor-receiving method is called. Is this roundabout?

No. The visitor-receiving method may be called again, on other, related, objects, by the visitor’s element-receiving method. You can exit the circle any time you would like.

For the sake of clarity, imagine that the element’s method is called receive(Visitor visitor), and the visitor’s method is called visit(Element element). I’ll give you a more tangible example.

1.1. The Visitor Pattern at Work

Suppose we have an object structure describing a bicycle. Each bicycle part is an extension of BicycleElement, which implements our accept method. The parts include a number of tires, two pedals, a handbrake, a chain, a seat, and a frame. Each of these elements is interchangeable and has a score of properties that belong to it alone.

Putting together such a class structure is trivial for a Java programmer; but what if we want to inventory the parts of a bike? Display their features on an output stream (for simplicity, System.out), as an example? At the same time, this analysis software must remain separate from the software describing the part, in the name of the open/closed principle and the single responsibility principle. At least, if you’re a principled programmer, with the least bit of concern for the next person to touch your code, you want to follow those.

The problem is solved through double-dispatch. Your analysis class, let’s call it Analyzer, might have a set of Visitors that expect to receive each category of BicycleElement. Once your CompleteBikeElement is prepared, Analyzer would call its accept method with a class of Visitor. The accept method would call the visit method with itself (and all properties exposed), and the visit method would display the properties of the BicycleElement.

Naturally, you might be wondering how you would get the properties of every BicycleElement when only the CompleteBikeElement was passed to the Visitor. That’s where receive comes in. CompleteBikeElement has total access to each of its parts, and can easily pass a visitor (maybe the same one) to each of them.

2. Implementing a Visitor Pattern in a single package

I’ve implemented this in the SLE git under oberlin.builder.visitor.

You only need three classes; Element, Visitor, and VisitHandler. In fact, you arguably only strictly need Element and Visitor. Their code is fairly straightforward, remembering that they are all abstract.

Element:

package oberlin.builder.visitor;

public interface Element {
    public void accept(Visitor visitor);
}

Visitor (feel free to be a little creative with this one):

package oberlin.builder.visitor;

import java.util.Map;

public interface Visitor {
    public default void visit(Element element) {
        getHandlerMap().get(element.getClass()).handle(element);
    }
    
    public Map<Class<? extends Element>, VisitHandler> getHandlerMap();
    
    public default void addVisitHandler(Class<? extends Element> elementClass, VisitHandler handler) {
        getHandlerMap().put(elementClass, handler);
    }
    
    public default VisitHandler getVisitHandler(Class<? extends Element> elementClass) {
        return getHandlerMap().get(elementClass);
    }
}

VisitHandler:

package oberlin.builder.visitor;

public abstract class VisitHandler {
    public abstract void handle(Element element);
}

Done. Now, as an aside to get the point across, let’s build that little bicycle project I was talking about.

2.1. Using the Visitor Package

Let’s start with our bike parts. For the sake of brevity, I’m leaving out package and import statements as they’re fairly self-explanatory.

public class ChainElement implements Element {

    @Override
    public void accept(Visitor visitor) {
        visitor.visit(this);
    }

}

public class FrameElement implements Element {

    @Override
    public void accept(Visitor visitor) {
        visitor.visit(this);
    }

}

public class HandlebarElement implements Element {

    @Override
    public void accept(Visitor visitor) {
        visitor.visit(this);
    }

}

public class PedalElement implements Element {

    @Override
    public void accept(Visitor visitor) {
        visitor.visit(this);
    }

}

public class SeatElement implements Element {

    @Override
    public void accept(Visitor visitor) {
        visitor.visit(this);
    }

}

public class TireElement implements Element {

    @Override
    public void accept(Visitor visitor) {
        visitor.visit(this);
    }

}

They could be anything, of course; in this instance, we’re leaving the actual work to Visitor, and how it gets there to Element. It might even be wise to make it a Java 8 Default Method.

Lastly, the complete bike:

package bicycle;

import oberlin.builder.visitor.Element;
import oberlin.builder.visitor.Visitor;

import java.util.*;

public class CompleteBikeElement implements Element {
    
    private FrameElement frame = new FrameElement();
    private SeatElement seat = new SeatElement();
    private PedalElement leftPedal = new PedalElement();
    private PedalElement rightPedal = new PedalElement();
    private TireElement frontTire = new TireElement();
    private TireElement backTire = new TireElement();
    private HandlebarElement handle = new HandlebarElement();
    private ChainElement chain = new ChainElement();
    
    @Override
    public void accept(Visitor visitor) {
        frame.accept(visitor);
        seat.accept(visitor);
        leftPedal.accept(visitor);
        rightPedal.accept(visitor);
        frontTire.accept(visitor);
        backTire.accept(visitor);
        handle.accept(visitor);
        chain.accept(visitor);
    }

}

So, what does our visitor look like?

package bicycle;

import java.util.HashMap;
import java.util.Map;

import oberlin.builder.visitor.Element;
import oberlin.builder.visitor.VisitHandler;
import oberlin.builder.visitor.Visitor;

public class BikeVisitor implements Visitor {
    @SuppressWarnings("serial")
    Map<Class<? extends Element>, VisitHandler> map = new HashMap<Class<? extends Element>, VisitHandler>();
    {
        map.put(TireElement.class, new VisitHandler() {
            public void handle(Element element) {
                System.out.println("[Tire qualities]");
            }
        });
        map.put(FrameElement.class, new VisitHandler() {
            public void handle(Element element) {
                System.out.println("[Frame properties]");
            }
        });
        map.put(ChainElement.class, new VisitHandler() {
            public void handle(Element element) {
                System.out.println("[Chain manufacturing data]");
            }
        });
        map.put(SeatElement.class, new VisitHandler() {
            public void handle(Element element) {
                System.out.println("[Seat qualities]");
            }
        });
        map.put(PedalElement.class, new VisitHandler() {
            public void handle(Element element) {
                System.out.println("[Pedal qualities]");
            }
        });
        map.put(HandlebarElement.class, new VisitHandler() {
            public void handle(Element element) {
                System.out.println("[Handlebar qualities]");
            };
        });
    }

    @Override
    public Map<Class<? extends Element>, VisitHandler> getHandlerMap() {
        return map;
    }
}

And lastly, our analysis program, which doesn’t even need to be in the same package. (Well, it is, but it doesn’t have to be.)

package bicycle;

public class Analysis {

    public static void main(String[] args) {
        CompleteBikeElement bike = new CompleteBikeElement();
        bike.accept(new BikeVisitor());
    }

}

As you can see, the analysis itself doesn’t have to do much. In the past, a different visitor was implemented for each element, which is still occasionally useful. In this instance, to cut down on the number of classes, I simply cross-reference the class of the element on a map, and retrieve the functional interface with the material to operate on that specific type of element. It’s better practice than using instanceof, not because instanceof is slow, but because it’s usually a red light that you’re overlooking a feature of object-oriented languages. This is also referred to as a “bad code smell“. (Map may also be slightly faster, working it out in my head; but I haven’t done any benchmarking.)

The result?

[Frame properties]
[Seat qualities]
[Pedal qualities]
[Pedal qualities]
[Tire qualities]
[Tire qualities]
[Handlebar qualities]
[Chain manufacturing data]

Of course, any element could have broken itself down further, and may continue to. This is a very flexible pattern, and understanding double-dispatching can lead to some very efficient architectures. When you understand how it works, you’re ready to continue!

A Final Note

Speaking of bad code smell, it is also worth noting that double-dispatching is never to be used unless it is truly necessary. Extensive double-dispatching (or triple-dispatching, or—no. It hurts to think about.) can be very bad code smell in itself. If you’re even a little speculative, you may have noticed that excessively open Visitor patterns are asking for trouble.

Uncontrolled dynamic dispatching is like having your fingers tangled in string, or trying to untie a massive wad of Christmas lights before the holiday because someone put them in a box instead of twist-tying them properly. Dynamic dispatching is the box; used improperly, it promises that a year later, you will be dealing with a massive wad of tied up Labyrinthian code, and you’ll have zero fun untying it. BECAUSE REALLY, WHY DID YOU PUT THE CHRISTMAS LIGHTS IN AN EFFING BOX, MOM!?!?

 
 

Tags: , , , , , , , , , ,

Software Language Engineering: Establishing a Parser, Part One (Early Edition, Refurbished)

The Introductory Ramble

I suppose I’ve been building up to this thing for quite a while now.

Note that this is part one of the Parser. Were this a singular project, I might be long-since finished with it; tragically I’m not that much of a jerk and I’m trying to write a useful tutorial too. This means I need to get it right, or as right as possible, the first time; so please pardon delays in bringing it to you. I’m also avoiding items like JLex so that I can convey exactly how a build tool is meant to be structured, for the sake of education.

You see, back in the days when it was a responsible thing to call oneself, I used to be a journalist. Specifically, I was a field journalist and interviewer for something called a “newspaper”, which was an antiquated piece of low-grade paper with news printed on it. It was a strange and inconvenient system, but it was what we had before the internet. Anyway, both language and ethics are important to me out of habit, which is also perhaps why I’m not a news journalist anymore.

As a small foreword, there are a lot of things which we must keep in mind while developing a general-purpose parser, as the operations of a Parser on a List<AST> are a bit of a juggle. In spite of the “One Class, One Responsibility” principle, the environment that a Parser is typically injected into requires a great deal of attention. As such, I will be splitting this section into two parts, with a blurb in the middle on the visitor pattern (which you are welcome to skip if you are already familiar with it), and a blurb on using JavaFX to render graphical ASTs. The second piece will be a completion of the parser design. (Point being, once you start reading, expect to be reading for quite a bit.)

The Parser

To begin, we have a List<AST> full of all Terminals in the program, in order. NonTerminals are where a lot of the magic happens, and each of them contains several nodes which may be either Terminals, or further NonTerminals. The biggest concern with any translator is efficiency in building this tree.

I have attempted several methods in the construction of this tree. One of them involved a repurposing of java.util.regex, but I recommend against it now that it has been practiced. There are too many temptations to use forward references and a number of other tools, and they can have serious caveats. As much as I usually praise regexes for their speed, machine generated ones are only usually fast. Depending on the structure of the code to manage the structure of the regular expression is an invitation to disaster, by which I mean, consumer remorse.

In the end, the visitor pattern won out; but if you would like to attempt it in another way, I encourage it. I’ll be documenting the visitor pattern here, as it assures a single pass through the target code.

Before I go further, let’s begin with the source code to parser.

import java.util.List;

import oberlin.builder.parser.ast.AST;
import oberlin.builder.parser.ast.EOT;
import oberlin.builder.visitor.Visitor;

/**
 * @author © Michael Eric Oberlin Dec 15, 2014
 *
 * @param <V> the visitor type that the parser
 * uses for creating nonterminal nodes
 * @param <P> the target class for the parsing, intended
 * to be the root of the produced syntax tree
 */
public abstract class Parser<V extends Visitor> {
    
    private final List<AST> astList;
    private ErrorReporter reporter = new ErrorReporter();
    private AST currentToken;
    private SourcePosition currentTokenPosition = new SourcePosition();
    private SourcePosition previousTokenPosition = new SourcePosition();
    protected V visitor;
    
    public Parser(V visitor, List<AST> astList, ErrorReporter reporter) {
        this.visitor = visitor;
        
        //Do a little defensive programming
        if(astList.isEmpty())
            throw new RuntimeException("AST list cannot begin at zero size");
        
        //Scan for a specific reporter, or use the default error reporter. Of
        //course error reporter can't really be null.
        if(reporter != null)
            this.reporter = reporter;
        this.astList = astList;
        
        this.currentToken = astList.get(0);
    }
    
    /**
     * Checks whether the current node is of the expected type; if so,
     * increments the token; otherwise, throws a syntactic error.
     * 
     * @param astExpected the currently anticipated node type in the list
     */
    public void accept(Class<? extends AST> astExpected) {
        if(astExpected.isAssignableFrom(currentToken.getClass())) {
            forceAccept();
        } else {
            reporter.error(new SyntaxException("Expected " +
                astExpected + ", got " + currentToken + " instead; "));
        }
    }
    
    public void forceAccept() {
        previousTokenPosition = currentTokenPosition;
        currentTokenPosition = currentTokenPosition.increment();
        try {
            currentToken = astList.get(currentTokenPosition.getStart());
        } catch(IndexOutOfBoundsException ex) {
            currentToken = new EOT();    //end of tree
        }
    }
    
    /**
     * Records the position of the beginning of a phrase.
     * This is the position of first constituent AST.
     * @param position element to record the begin index into.
     */
    public void start(SourcePosition position) {
        position.setStart(currentTokenPosition.getStart());
    }
    
    /**
     * Finish records the position of the end of a phrase.
     * This is the position of the last constituent AST.
     * @param position element to record the end index into.
     */
    public void finish(SourcePosition position) {
        position.setFinish(currentTokenPosition.getFinish());
    }
    
    /** utility method for reporting syntax errors */
    public void syntacticError(String messageTemplate,
        String tokenQuoted) {
        SourcePosition pos = currentTokenPosition;
        reporter.error(new SyntaxException(
                tokenQuoted + " " + messageTemplate + ": " +
                pos.getStart() + ".." + pos.getFinish()));
    }
    
    /**
     * Begin parsing, aiming to create the provided class
     * as a root class for the abstract syntax tree.
     * 
     * @param rootClass Class of object which should, provided
     * no exceptions, be a tree root.
     * @return complete tree, stemming from class rootClass,
     * expressing program.
     */
    public AST parse(Class<? extends AST> rootClass) {
        return visitor.visit(rootClass, this, currentTokenPosition);
    }

    public SourcePosition getPreviousTokenPosition() {
        return this.previousTokenPosition;
    }
    
    public SourcePosition getCurrentTokenPosition() {
        return this.currentTokenPosition;
    }

    public AST getCurrentToken() {
        return currentToken;
    }
    
    public V getVisitor() {
        return visitor;
    }
    
    public ErrorReporter getErrorReporter() {
        return reporter;
    }

}

It all begins with a call to parse(…), passing in the class that represents a completed program. Parse enters the visitor with a reference to the type of analyzer needed; that to deliver a properly formatted “rootClass”. I’m going to abstain, for now, on explaining how the visitor works; that’s covered in another section. For the moment, consider the variant utility methods in Parser.

It contains the original list of ASTs, presumably all Terminals; and a number of pointer fields. The List<AST> is technically outside data, passed in to form a tree from. In the class’s current state, it is very important to remember that it cannot be used on more than one chunk of code at once.

An Algebraic Parser

The intention, just as with a scanner, is to make life as simple for the end programmer as possible. So, AlgebraicParser isn’t that big a deal to implement.

package oberlin.algebra.builder.parser;

import oberlin.builder.parser.Parser;

public class AlgebraicParser extends
        Parser<AlgebraicPhraseStructure> {

    @Override
    public Class<AlgebraicPhraseStructure> getPhraseStructure() {
        return AlgebraicPhraseStructure.class;
    }

}

Most of the work is done by AlgebraicPhraseStructure, an extension of an interface I’ve named PhraseStructure, which extends Visitor. Accordingly, I’ll be addressing it in full in part two. The take-away is that PhraseStructure encapsulates all of the constraints on how a command is properly formed.

Well, it forms no critique of ethics, but it at least worries about it’s readability.

The Take-Away

One, but hardly the only, major confusion in Parser construction is the encapsulation of the act of parsing, versus the encapsulation of the rules of parsing. They are, ultimately, two different things requiring two different programs (or in Java’s language, classes).

Parser returns a singular AST, which is formed from a number of others. Each of the others is formed for a list of still more unique nodes. We have a specific advantage provided to us here, that being the enforced uniqueness of a Java object. Once the tree is completed, remaining operations on it, such as translating it to another kind of tree (varying in language), are fascinating but relatively trivial.

Next, read about the visitor pattern, and consider reading about the GUI interface built for this; then we’ll discuss the workhorse of the scanning and parsing utility—the phrase structure.

 
 

Tags: , , , , , ,

Software Language Engineering: And Again We’re Building a Scanner Again (Early Edition)

Remember how in part one, I said that there was one final change to make to the Scanner? Here it is. Now that you know about BNF, ASTs, Nonterminals, and Terminals, I can comfortably show you my completed Scanner.

The Big Difference

The difference here is that this Scanner will not return a List<String>, so much as a List<AST> full of Terminals. Each Terminal will contain one token, that is, the String that used to be in its place. It really is a minor structural change, but it is an important one.

The only other option is to scan through the list of Strings again, match each of them, and produce a List<AST>; and that’s just inefficient. We should not need to iterate through the list twice when we can get the job done with a single pass. I have rules about efficient code, and so should you.

The Scanner (for real this time)

package oberlin.builder.scanner;

import java.util.*;
import java.util.function.*;
import java.util.logging.*;

import oberlin.builder.*;
import oberlin.builder.parser.ast.*;

/**
 * New design for scanners. Requires an enumeration of type
 * TerminalSpelling to iterate over, and a code string.
 * Returns a list of tokens.
 * 
 * @author © Michael Eric Oberlin Nov 2, 2014
 *
 */
public interface Scanner<E extends Enum<E> & TerminalSpelling> extends Function<AST, List<AST>> {
    @Override
    public default List<AST> apply(AST code) {
        Logger logger = Logger.getLogger("Scanner");
        
        //Start with a list of tokens, with the singular pre-token of the bulk of code
        List<AST> tokens = new LinkedList<>();
        tokens.add(code);
        
        //For each item found in the code, parse it, and replace the contents of tokens
        //with the parsed contents
        for(int index = 0; index < tokens.size(); index++) {
            //maintain for check
            AST origToken = tokens.get(index);
            
            List<AST> newTokens = new ArrayList<>();
            for(TerminalSpelling terminalSpelling : getSpelling().getEnumConstants()) {
                
                try {
                    newTokens.addAll(terminalSpelling.matchToken(tokens.get(index)));
                    
                    replaceItemWithCollection(tokens, index, newTokens);
                    break;
                } catch (MismatchException e) {
                    //didn't match, so continue to the next item
                    continue;
                }
            }
            
            //Final defensive check: if one token was received, and one token provided, and
            //the old is not the same as the new, then the only thing that happened was the
            //removal of irrelevant data. Thus, index should be decremented so that this new
            //item may be scanned again.
            if(newTokens.size() == 1 && !newTokens.get(0).equals(origToken)) index--;
        }
        
        //This algorithm will always terminate the token list with an empty token, the one item
        //which cannot have semantic value. So, remove it here.
        tokens.remove(tokens.size() - 1);
        
        return tokens;
    }
    
    /**
     * Internally used method which replaces an entry in a provided list with a collection.
     * 
     * @param list list to be altered
     * @param entry numeric index of replaced entry
     * @param replacement item to insert in place of current data
     */
    static <E> void replaceItemWithCollection(List<E> list, int entry, Collection<E> replacement) {
        list.remove(entry);
        list.addAll(entry, replacement);
    }
    
    /**
     * 
     * @return TerminalSpelling allocated to token recognition.
     */
    public Class<E> getSpelling();
}

So that’s your Scanner, almost exactly the same, but now it uses an enumeration of a type called TerminalSpelling. Its structure might also look a little familiar.

package oberlin.builder;

import java.util.LinkedList;
import java.util.List;
import java.util.regex.*;

import oberlin.builder.parser.ast.*;

public interface TerminalSpelling {
    
    public default List<AST> matchToken(AST ast) throws MismatchException {
        List<AST> returnable = new LinkedList<>();
        
        Pattern pattern = getPattern();
        Matcher matcher = pattern.matcher(ast.toString());
        if(matcher.find()) {
            returnable.addAll(manageToken(matcher));
            returnable.add(new Terminal(ast.toString().substring(matcher.end())));
            return returnable;
        } else throw new MismatchException("String \"" + ast +
            "\" does not match grammar pattern + \"" + pattern + "\"");
    }
    
    public Pattern getPattern();
    
    /**
     * Determines, often through enumeration self-inspection, what should be done
     * with a passed token.
     * 
     * @param token the original token removed from the complete String by matchToken
     * @return appropriate value given the circumstances and implementation of TerminalSpelling.
     */
    public List<AST> manageToken(Matcher matcher);
}

And how does this new structure related to our pilot programming language, the Algebraic Simplifier?

package oberlin.algebra.builder.scanner;

import java.util.List;

import oberlin.algebra.builder.AlgebraicSpelling;
import oberlin.builder.scanner.Scanner;

public class AlgebraicScanner implements Scanner<AlgebraicSpelling> {
    
    @Override
    public Class<AlgebraicSpelling> getSpelling() {
        return AlgebraicSpelling.class;
    }

}

Cake.

As for the TerminalSpelling enumeration?

package oberlin.algebra.builder;

import java.util.LinkedList;
import java.util.List;
import java.util.logging.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import oberlin.builder.TerminalSpelling;
import oberlin.builder.TerminalSpellingHandler;
import oberlin.builder.parser.ast.AST;
import oberlin.algebra.builder.nodes.*;

/*
 * Major changes:
 * For starters, if a chunk of the string matches a pattern, then that group
 * needs to be returned as the spelling of a token.
 * <Specific Atomic Type> ← Token ← Terminal ← AST
 * 
 * All tokens should contain a reference to their regular expression/grammar.
 * 
 */
public enum AlgebraicSpelling implements TerminalSpelling {
        //COMMENTS
        WHITESPACE(Pattern.compile("^\\s+"), GrammarType.COMMENT, new TerminalSpellingHandler<Whitespace>(){

            @Override
            public Whitespace getTerminal(String spelling) {
                return new Whitespace(spelling);
            }
            
        }),
        BLOCK_COMMENT(Pattern.compile("^/\\*.*?\\*/"), GrammarType.COMMENT, new TerminalSpellingHandler<BlockComment>(){

            @Override
            public BlockComment getTerminal(String spelling) {
                return new BlockComment(spelling);
            }
            
        }),
        LINE_COMMENT(Pattern.compile("^//.*+$"), GrammarType.COMMENT, new TerminalSpellingHandler<LineComment>(){

            @Override
            public LineComment getTerminal(String spelling) {
                return new LineComment(spelling);
            }
            
        }),
        
        //VALIDS
        NOMINAL(Pattern.compile("^[\\D&&\\w]\\w+"), GrammarType.KEEP, new TerminalSpellingHandler<Nominal>(){

            @Override
            public Nominal getTerminal(String spelling) {
                return new Nominal(spelling);
            }
            
        }),
        NUMERIC(Pattern.compile("^\\d+"), GrammarType.KEEP, new TerminalSpellingHandler<Numeric>(){

            @Override
            public Numeric getTerminal(String spelling) {
                return new Numeric(spelling);
            }
            
        }),
        OPERATOR(Pattern.compile("^[+-/\\\\÷\\*×\\^]"), GrammarType.KEEP, new TerminalSpellingHandler<Operator>(){

            @Override
            public Operator getTerminal(String spelling) {
                return new Operator(spelling);
            }
            
        }),
        EQUATOR(Pattern.compile("^!?=?[=><]"), GrammarType.KEEP, new TerminalSpellingHandler<Equator>(){

            @Override
            public Equator getTerminal(String spelling) {
                return new Equator(spelling);
            }
            
        }),
        
        //DELIMITERS
        LPAREN(Pattern.compile("^\\("), GrammarType.KEEP, new TerminalSpellingHandler<LParen>(){

            @Override
            public LParen getTerminal(String spelling) {
                return new LParen(spelling);
            }
            
        }),
        RPAREN(Pattern.compile("^\\)"), GrammarType.KEEP, new TerminalSpellingHandler<RParen>(){

            @Override
            public RParen getTerminal(String spelling) {
                return new RParen(spelling);
            }
            
        })
    ;
    
    //PRIVATE FIELDS
    private Logger logger = Logger.getLogger("AlgebraicSpelling");
    
    //ENUMERATIONS
    private enum GrammarType {
        KEEP,
        COMMENT;
    }
    
    //FIELDS
    private final Pattern pattern;
    private final GrammarType type;
    private final TerminalSpellingHandler<?> handler;
    
    //CONSTRUCTORS
    /**
     * The two underlying details for any syntax pattern are its regular expression, and its semantic meaning.
     * The pattern is simply a compiled regular expression for matching the item. The type is a clue to its
     * semantic meaning, which is checked in a switch-case statement when called.
     * 
     * @param pattern compiled regular expression identifying the token
     * @param type clue as to what should be done with the token once found
     */
    private AlgebraicSpelling(Pattern pattern, GrammarType type, TerminalSpellingHandler<?> handler) {
        this.pattern = pattern;
        this.type = type;
        this.handler = handler;
    }
    
    //GETTERS/SETTERS
    @Override
    public Pattern getPattern() {
        return pattern;
    }

    @Override
    public List<AST> manageToken(Matcher matcher) {
        List<AST> ret = new LinkedList<>();
        switch(this.type) {
        case KEEP:
            ret.add(handler.getTerminal(matcher.group()));
            break;
        case COMMENT:
            //Just ignore it
            break;
        }
        return ret;
    }

}

A bit longer, but as you can see, still very simple. Before I address the minor details reserved for a language implementation, allow me to give you the trivial definition of TerminalSpellingHandler:

package oberlin.builder;

public interface TerminalSpellingHandler<E extends Terminal> {
    public E getTerminal(String spelling);
}

Barely even worth a class, really. It could easily just be a java.util.function.Function<String, E extends Terminal>. It remains a candidate for expansion, should you need it; and unlike Function, can easily be altered to throw a non-runtime exception. (Should you find yourself needing to throw such a thing.)

So, by my recommended structure, the enumerations contain three simple things. Specific java.util.regex.Patterns describing how they appear (always beginning with the metacharacter “^” to denote the beginning of the String), an enumeration describing what to do with the token once it’s found, and a functional interface that acquires a Terminal from the token. I will concede that some of these items are candidates for defaults in the interface, but as you know, not all languages are structured the same and for the moment, I leave this to you.

Our End Result

Ultimately, we have a class that returns a List<AST> full of Terminals, which is exactly what our Parser works with. This allows for redundant calls and stripped levels of processing that allow for a much cleaner Θ-notation.

Additionally, the class captures almost every recurrent portion of Scanner that I can think of, and is easily and quickly extended for whatever your target language might be. This is the greater service of a properly designed class—it takes work and stress off of the programmer. It is likewise going to make the next step much easier.

Now we build us a Parser.

 
 

Tags: , , , , , , , , ,

Software Language Engineering: Terminals, Nonterminals, and Bears (Early Edition)

The Notion of a Terminal AST

So, in the last chapter, I explained what an AST was structurally. There are formally two kinds of extensions to it. I usually implement them in their own classes, extending the base class of AST.

The first is the terminal. If you’ve programmed in Java for even a month, you know that having a method which accepts two different kinds of unrelated classes in the stead of one another is a bad idea for all kinds of reasons.

It is, actually, possible.

That just doesn’t mean that you should do it.

ASTs are formally called trees, but what they are is nodes on a tree. A program is a single AST, with an extension typically called “Program” as the overarching root node. The branch nodes, or nodes with child nodes, are called non-terminal nodes; the end points are called terminal nodes.

Each of those tokens that your Scanner kicked out? Yeah, that’s a terminal node, disguised as a String.

Let me offer you some of my code for ASTs, Terminals, and NonTerminals. (As before, there are major issues that I’m leaving out until later on. See if you can catch them.)

package oberlin.builder.parser.ast;

import oberlin.builder.codegenerator.RuntimeEntity;
import oberlin.builder.visitor.*;

/**
 * Abstract Syntax Tree, capable of representing any sequence of 
 * statements or the entire program.
 * 
 * @author © Michael Eric Oberlin Nov 3, 2014
 *
 */
public interface AST {
    /**
     * @return number of sub-elements contained in this tree node.
     */
    public int getElementCount();
}
package oberlin.builder;

import oberlin.builder.parser.ast.AST;

/**
 * Basis of all complete abstract syntax trees. Terminals are basically isolated-tokens known only by their spellings.
 * 
 * @author © Michael Eric Oberlin Nov 5, 2014
 *
 */
public class Terminal implements AST {
    private final String spelling;
    
    public Terminal(String spelling) {
        this.spelling = spelling;
    }
    
    public final String getSpelling() {
        return this.spelling;
    }

    @Override
    public String toString() {
        return getSpelling();
    }

    @Override
    public final int getElementCount() {
        return 1;
    }
}

Let those soak in a little. Note that each Terminal has a length of one, meaning that it is the only member of its immediate tree. That will be important when we develop our Parser.

A terminal is an instance of an AST, and can be created by simply passing its token to it. The token is stored in the field “spelling”. Terminal is also fully extensible, as even though their token is consistently their only members, there is a significant difference between an equals operator, a binary operator, and numerical data; and nonterminal nodes take that difference quite seriously.

The Notion of a Nonterminal AST

A nonterminal AST is an AST built not from characters in a String, but from a sequence of other ASTs. The constituent ASTs can be terminals, or nonterminals. Remember BNF? Ever listed item before may-consist-of (::=) was a nonterminal. In instances of single-member representation, such as:

Noun ::= ProperNoun

“Noun” is still a nonterminal, as the implication is that (at least in theory) multiple types of items can be its constituents.

The “Program” node is of course always a nonterminal. I’ve written a nice slice of code for them, too.

package oberlin.builder;

import java.util.*;

import oberlin.builder.parser.ast.AST;

public abstract class NonTerminal implements AST {
    private final List<AST> astList;
    
    public NonTerminal(AST... astList) throws MismatchException {
        if(!checkTypes(astList)) throw new MismatchException("Nonterminal class " + this.getClass() + " does not match " +
                "expression.");
        
        List<AST> list = new ArrayList();
        for(AST ast : astList) {
            list.add(ast);
        }
        this.astList = list; 
    }
    
    public NonTerminal(List<AST> astList) throws MismatchException {
        try {
            this.astList = resolveTypes(astList);
        } catch(BuilderException ex) {
            throw new MismatchException(ex);
        }
    }
    
    public abstract List<Class<? extends AST>> getExpectedASTTypes();
    
    /**
     * Check to see that all provided ASTs are some extension of the expected class of AST,
     * and create the internal list of ASTs from it if possible.
     * 
     * @param astList current list of program ASTs 
     * @return true if the first ASTs match the expected ones, false otherwise
     */
    private List<AST> resolveTypes(List<AST> astList) throws BuilderException {
        List<AST> ownASTs = new ArrayList<>();
        List<Class<? extends AST>> astTypes = getExpectedASTTypes();
        
        for(int i = 0; i < astTypes.size(); i++) {
            Class<? extends AST> provided, expected;
            provided = astList.get(i).getClass();
            expected = astTypes.get(i);
            if(!expected.isAssignableFrom(provided)) {
                throw new BuilderException("Cannot get " + expected + " from " + provided.getClass());
            }
            ownASTs.add(astList.get(i));
        }
        return ownASTs;
    }
    
    /**
     * Check to see that all provided ASTs are some extension of the expected class of AST.
     * 
     * @param astList current array of program ASTs 
     * @return true if the first ASTs match the expected ones, false otherwise
     */
    private boolean checkTypes(AST... astList) {
        List<Class<? extends AST>> astTypes = getExpectedASTTypes();
        
        for(int i = 0; i < astList.length; i++) {
            Class<? extends AST> provided, expected;
            provided = astList[i].getClass();
            expected = astTypes.get(i);
            if(!provided.equals(expected)) {
                return false;
            }
        }
        return true;
    }
    
    @Override
    public int getElementCount() {
        return astList.size();
    }
}

This is where the critical detail is left out of my code. If you already see it, then congratulations! In any case, I’ll let you all in on it at the end of the chapter. It goes back to the advantage of regular expressions over manually checking lists.

In “Programming Language Processors in Java”, David A. Watt and Deryck F. Brown use a visitor pattern to scan for these. That’s a perfectly valid approach. I get the same advantages through my constructor-or-exception pattern, which you have seen before. In fact, if you’re careful, you may notice that it’s another format of the same general pattern; without the traditional Visitor and Element classes. In my github repo, you might notice that I have the outline of a Visitor pattern implemented on these. Pay it no mind, it is proving unnecessary.

Still a good idea. Just unnecessary.

What Do They Mean to a Parser?

A Parser’s job is thus. Iterate through a list of Terminal syntax tree nodes. Compare the beginning of the list to established members of a list of nonterminal nodes. Find a match.

Truncate the matched beginning of the list of ASTs; this is, remove every element that is a member of the new NonTerminal. Now that they are removed, append that NonTerminal to the beginning of the list, in their place. Repeat this process until the list of ASTs is of size one, and return that singular AST.

And of course, if the program cannot be parsed down to a singular AST, throw a Parser Error.

Caveats

These are meant to be identified through members of an enumeration; just as tokens were. However, you might remember BNF items such as:

Command ::= Command+

That is, a variable number of Command nodes can be condensed into the immediate tree of a single Command node. This is an issue for a simple list of ASTs, as in its provided implementation, NonTerminal simply attempts to match a fixed sequence of AST types to a provided list of ASTs, one by one. This is almost always inadequate for a programming language.

I’m going to step around writing a whole new series on how regular expressions work. (For now.) The important detail is that a minimum and maximum number of items, extending a specific AST class, need to be matched. I was tempted to call these “Clauses”, but that just barely relates to the actual definition of “clause”; so instead, we’ll borrow from regular expression terminology and call them Groups.

The down side? I’m still implementing Groups. They will have their own chapter though.

Now, we learn how to build a Scanner. (Again. Sort of.)

 

Tags: , , , , , , , , , , , , ,