The truth is, there are very few things that I can write an original tutorial on when it comes to software, and I have no intention of doing anyone the disservice of insisting on my own tutorial when someone else may have done a perfectly good job, and likely a superior one. But there are some things, things that not every programmer does, that there are excruciatingly few tutorials for on the internet.
If there’s one thing that is eternally true about the internet, it’s that whatever the topic, there are always at least a few people browsing for it. This is for you. This whole article chain is going to be about writing programming languages, be they scripts or natives, and most specifically it will be about writing compilers for them.
[For those interested in downloading my source code for the tutorials (which are being written and tested as I create this), you may do so via the github repository.]
There is, pleasantly, an MIT course available to the public on the subject. Unfortunately, by the nature of a sequence of lectures, it is likely to take you much more time than reading through these articles, and time is not a commodity for us. All the same, I do encourage you to at least browse through it sometime; the video is fairly low-resolution, but the material is great. As mentioned in it, there are two programs commonly used for creating the scanner and parser, respectively JLex and CUP. I will not be using them, but if you’re in a pinch, you might consider it.
I will be using Java 8, as it has a lot of wonderful features for scanning through collections (even vast, nay, gargantuan ones), and pretty good regular expression tools. If you’re concerned with the double-escaping of regular expressions that is inherent to the structure of the Java language, I recommend using this regex parser, as it allows you to work with singly-escaped regular expressions and converts them to a Java-comprehensible string. Most of us don’t have a natural knack for any regular expression syntax, so it comes in handy to have a testing environment. There are a number of other good ones out there, and you’re of course also welcome to produce your own kit if the fancy strikes you.
If you haven’t caught on, regular expressions are invaluable for this stuff.
Every build process has a number of standard components, the presence of each of which describes the nature of the build. They share a few things in common, which thankfully can be modularized into their own abstract classes. I’ll be showing you my style as we work through this.
Given that you’re even bothering to read the preface (it’s certainly no insult if you don’t), you may be expecting a little history on compilers. The first one was built in 1952 and simply denoted “A-0“, for “Arithmetic Language Version Zero”. It was created by the late Rear Admiral Grace Hopper of the United States Navy. This is actually a long stride after the first programmable machine, a loom built by Joseph Marie Jacquard introduced to the world in 1801. Programming, before that point, consisted basically of hex-editing and the notion of “coding” was, well, not really a notion at all. If you are a hardcore programmer (and I’m assuming you are), then memorize her name; we lost her in 1992, but the face of computing has forever been changed by her work.
You might also memorize Jacquard, as he was basically treated like Doctor Frankenstein by the local villagers and had the barn in which he kept his programmable looms burnt down by a bunch of Luddites, while he was off at war fighting for his nation. It didn’t work, of course. Torches and pitchforks aside, he still, also, changed the world, and I’m not just talking about the garment industry.
A-0 went through a number of phases, inclusive of A-0, A-1, A-2, A-3 (also known as ARITH-MATIC), AT-3 (also known as MATH-MATIC), and finally B-0 (FLOW-MATIC). I’m not particularly sure where this chain ended, but I am fairly certain that it did end and I am quite confident that I’ve never seen a word of any of them in my life. The most bronze-age language which is still in reasonably common use (C) derived from B, which was created from BCPL (backronymed “Before C Programming Language”, which was created from ALGOL 60 (which came from ALGOL 58), which came from FORTRAN, which came from a loose collection of macros known as Speedcode which IBM developed in 1953. (Note that it’s still a full year after Read Admiral Hopper immaculately created the first compiler without any other compiler to build through. Let that sink in.) C itself has been through three ANSI standards, C89, C99, and most recently C11, but they’re pretty much the same game with keywords and native structures made more appropriate to their time. C led to C++, C++ led to Java, Ruby and Python came out of pretty much nowhere, and then there’s all these other guys over there, and quite a few domain-specific little guys, and more languages than any of us will ever personally be fluent in. (I don’t care what your resumé says.)
They all have a few things in common, though. As an example, they are textual. (Don’t let this imply to you that there’s no such thing as a non-textual language, though; LabView and even HyperCard made quite a few advances in that direction.) They’re all stored in data structures, which are usually what we would think of as files. They’re all, most importantly, interpreted and encoded by another program. We will be building that program, and if you keep to a certain code of ethics in your construction, you will find it to be a fairly simple process.
Believe it or not, it’s a bit of a tradition to develop a compiler for a language in the language itself. It’s called “bootstrapping“. Obviously this isn’t so readily done with the first one (this isn’t an improbability drive), but the tradition began in 1958 with a language you’ve never heard of called NELIAC (basically another kind of ALGOL, which you may also never have heard of). It was popularized in 1962 by Lisp. Today, it’s commonly used for C, Java, Python, Scheme, and a plethora of others. It offers a few incidental advantages, inclusive of reducing the compiler’s object code, applying compiler output improvements to its own code, and the HR benefit of having no need for the programmers to know more than the compiled language. I’m sure I’ll come back to this, as when you can do it, you should do it. You get to catch the wave of every other programmer’s improvements to the compiler’s code, and add your own two bits to it.
Wrapping up, in the dendrites of the great tree of software language tendencies, we’re moving in a lot of different directions right now. Some of them are contradictory, sometimes both choices have advantages, and I can easily say that more people know how to program right now than at any other point in history. Software and hardware advances, like Google’s QScript and the OpenCL Standard maintained by Khronos, are splitting the once monotonous course of digital electronics into a thousand different rays, and it can go literally anywhere we want it to. Remember that what I am describing to you in this series is the standard method of engineering a computer language; it is hardly the only possibility. It’s an honor to help you follow it, and possibly, for quite a few of you, lead it.