Tag Archives: NLP

Software Language Engineering: The Abstract Syntax Tree (Early Edition)

What is an Abstract Syntax Tree?

(Let me preface this by saying that syntax trees are hardly the only way to write a translator; but they’re broadly used because of their simplicity, versatility, and flexibility.)

“Abstract” as in abstraction. An abstract syntax tree (henceforth AST) is the manner in which we will be encoding every command of our source and object programs. Think of it as an internal representation of the semantics, derived from the original syntax, and encodable in the final syntax.

The biggest trick with an AST is to remember not to confuse it for its representation. Syntax and semantics are elemental, in the old-school black-magic Aleister Crowley sense. You can’t even talk about one without comparing it to the other. This is, in my personal analysis, where a lot of the works-like-magic notion about compilers comes from; but there’s nothing magical about it.

To understand this, think about what happens when you’re speaking with another human being. The art of communication is the act of transference of a thought from one mind to another, through some mutually external medium; be it speech, or writing, or hand gestures. When someone begins to speak, and you begin to listen, your brain (operationally) initially considers every single possible thing that the speaker is about to say. As they pronounce the phonemes and syllables and words of their sentence, the possibilities narrow, and different semantic clusters in your mental network go dark. Once they’ve finished their statement, given that it was well-formed, you have a reasonably precise notion of what it was that was in their mind, which they are trying to convey to you. You can duplicate it, within reason. If it was ill-formed, you might only have a vague, or even a holistically incorrect, notion.

This is the concept of an AST; the AST is the thought itself. It is a (reasonable) binary representation of the program data. I used to think of ASTs as being something like a sentence diagram; but this isn’t really the case, and it in fact held me back for some time. My script grammars were limited to only mild variations of the structure of the language they were written in. ASTs are not grammatical structures; they are more implicit than that. This is the greatest reason why you must not confuse what an AST is for what an AST might look like; it’s the difference between using it like one would a first language, and a second language.

I’m assuming that you have already read the chapter on building an efficient Scanner. I’m also encouraging you to keep the Glossary open in another tab, as I may accidentally jump ahead in my definitions in an early edition of this piece. (Promise, I’ll review it with a giant whiteboard and lots of draft-mechanics before I declare the series finished; you are always free to scream at me about it in the comments.)

So, as an example, let’s consider basic English. Actually, let’s consider Bad English, because I’m not willing to jot down every single rule for a language with a lexicon three times larger than any other western language. Bad English will have a limited set of rules, and will generate a set of only a few basic sentences. For further simplicity, they will all be declarative, much like a programming language.

The first step, before you even begin to piece together an AST, is to write down the rules of your language in a formal specification; that is, a specification with a distinct set of rules to its grammar, which often must be known well by the reader, and allows for little chance of misinterpretation. An informal specification is usually presented in natural language, like this article; it is understood by a much wider audience, but is subject to considerable misinterpretation in contrast.

I recommend Backus-Naur Form (BNF). BNF has a number of translators written for it, often involving meta-programming which generates usable source code. I won’t be using any here, but they do exist. The chief notion for writing it is for you, particularly as you are creating a new language.

Primary BNF looks something like this.

Sentence ::= Subject Verb Object .
Subject  ::= I
Subject  ::= a Noun
Subject  ::= the Noun
Object   ::= me
Object   ::= you
Object   ::= a Noun
Object   ::= the Noun
Noun     ::= dog
Noun     ::= stick 
Noun     ::= hog
Verb     ::= ate
Verb     ::= caught
Verb     ::= bought

Before I go further, I should discuss the concept of a terminal, and nonterminal, symbol. All of the words in the above definition are some form of symbol; a terminal symbol can be parsed no farther. A nonterminal symbol can be broken down.  Any program syntax consists of a finite set of nonterminal symbols, and a finite set of terminal symbols. In the example above, “Sentence” is a nonterminal symbol, as it can be inspected to retrieve a Subject, a Verb, and an Object. “I” and “dog” are terminal symbols, as they cannot be broken down further.

Note that the BNF structure isn’t ideal yet. The “::=” is pronounced “may consist of”, and means as much. A sentence is a subject, followed by a verb, followed by an object, followed by a terminal period. (Subject-verb without object is a syntax error in Bad English.) It is thus represented as “Sentence := Subject Verb Object .” So what is a Subject? A subject may consist of “I”, may consist of “a Noun”, and may consist of “the Noun”. A Noun may consist of “dog”, “stick”, or “hog”.

A simple AST would be a Sentence tree, referencing a Subject tree, a Verb tree, and an Object tree, with an implicit period. You might already see how this could be parsed to create a sequence of generated functions, regular expressions, and switch-case statements, which would work as a parser and produce a Sentence tree. But remember how I said that the BNF was not complete? While grammatically correct and certainly easier to form, the biggest issue with making this deterministic is the repeated definitions.

You need to take something like the four definitions for “Object” and condense them into one. In BNF syntax, this is usually done with a vertical bar “|”, meaning “or”. Condensing the nonterminal definitions, we have the much more palatable:

Sentence ::= Subject Verb Noun .
Subject  ::= a Noun | the Noun | I
Object   ::= me | you | a Noun | the Noun
Noun     ::= dog | stick | hog
Verb     ::= ate | caught | bought

Much nicer, yes? But still not ideal. BNF has been around for a while; it actually goes back as far as ALGOL 58, in 1959. He was looking at Noam Chomsky’s work with context-free grammars, The first (arguable) regular expressions were actually developed in 1956 by Stephen Keene, but they were pretty primitive. Much of the format of BNF served as a major influence on them.

The above works fine for such a simple language as Bad English, as read by a human being with our own translating and interpreting internal hardware. For a program, the simpler the better, remembering that the most important factor going into BNF is preservation of syntax. Consider this instead:

Sentence ::= Subject Verb Object .
Subject  ::= (a | the) Noun | I
Object   ::= (a | the) Noun | me | you
Noun     ::= (d|h)og | stick
Verb     ::= ate | (ca|bo)ught

This is perfectly valid, and even more concise. Remove the whitespace, and any one of them can easily be condensed into a working regular expression. More importantly, redundancies become very easy to catch on implementation.

So How Do I Use It?

When you are constructing a language compiler or interpreter, especially from the ground up, always write these out first. Often, it will be exclusively for your own reference. Everything you’re about to do with the Parser module is encapsulated in that BNF-formatted data.

I have an abbreviated format, which I usually use, which basically has the type name, “::=”, and a regular expression for the sequence of elements that this type contains. There are ups and downs to using regular expressions directly; the biggest issue that I run into is that single-member may-consist-ofs (such as, “Identifier ::= Numeric | Nominal” can easily be implemented by simply having the classes for Numeric and Nominal extend a class for Identifier. Then it’s just a question of «expected class».isAssignableFrom(«provided class»). Ultimately, if you can comfortably implement it, then yes, you can do it. I’ll be showing you another way, which I feel is often a cleaner implementation.

Does This Have Something to Do With Your Caveat On the Scanner?

Why yes. Yes it does.

It’s much more convenient for a Parser to simply handle ASTs. Having it handle items which may be either ASTs, or Strings, is extremely difficult in Java. Give me one more chapter to explain the details of “Terminal” and “NonTerminal” ASTs, and then I’ll show you what needs to happen to your Scanner.

I promise it isn’t big.

[To Be Completed]


Tags: , , , , , , , , , , , , ,

Software Language Engineering: Acknowledgements


I have a lot of people to thank for my knowledge of translator construction, which is always growing, especially lately. My chief interest in putting this tutorial together was not simply helping out other curious minds (as important as that is), but also solidifying my own understanding of the process. As Albert Einstein once said, “You do not really understand something until you can explain it to your grandmother”.

My first translators, though I may not have called them that at the time, were built maybe fifteen years ago, when I first attempted to add executable data to a personal file format with a newly constructed grammar. I made a lot of terrible mistakes, and I learned from them. (You will too.) What is the nature of human inquisition and curiosity if not adaptation? Writing translator engines is a notoriously nonlinear and irrational problem (technical term!),

For starters, there is an excellent manual on the subject, first published in the year 2000 by Pearson Education Limited. It’s called “Programming Language Processors in Java“, by David A. Watt and Deryck F. Brown. Java was very different back then, and I spot new things in the example code which I can optimize all the time; but the core material is perfectly communicated. It’s in the realm of a hundred dollars, but if you’re even knee deep in this, I recommend it. (Tragically it is not yet available via Safari Books.)

Additionally, via MIT’s OpenCourseWare, you have an entire course on the subject. It appears to be partially complete; the videos can be found under “Lecture Notes”. I’m still familiarizing myself with the professors, Prof. Martin Rinard and Prof. Saman Amarasinghe, but they have been very helpful. Like any online course, you should make it a point to find extracurricular work in the subject.

Finally, my general education on the subjects of linguistics, psycholinguistics, and neurolinguistics have provided me with a grasp of that nature of language that goes far beyond natural language. While I must put emphasis on the more directly related text above, there are many great books on these subjects, by equivalently great teachers, which are very helpful alongside them. Joseph O’Connor and John Seymour have played a large part into my comprehension of the human end of any language, and believe me, it is important to you.

I have another book in front of me, as yet unopened, on modern compiler design; but I won’t be mentioning it until I actually use it. That wouldn’t be particularly fair. I’ve also received some help, at least in the form of my own constructive criticism, from the work of Andrew Davison, all of the unnamed writers and editors of Oracle’s JavaScript in Java guides, and a world of programmers who came before me. I have a lot to thank each of them for.

I believe that my work is sufficiently original beyond the bindings of actually working; but should any of them determine that my code is too close to theirs, or that I am in violation of a license of some kind, please contact me so that I may make amends. The future of computing lies on the backs of the teachers as much as it does the engineers.


Tags: , , , , , , , , , , , , ,

Empathic Cars?

Empathic Cars?

So recently, I ran across this guy.

Sensible enough. It has a few apparent assumptions in it; such as that we all have equivalently coded expressions corresponding to our emotions (a big leap), we all drive worse when we’re angry or flustered (a moderate leap), and that the number one reason why we might be lowering our eyelids is that we are tired (moderate).

The concept of determining emotion and mental state by reading body language cues is not a new one; in fact, it is integral to communication. Humans have an astounding 43 muscles in the face, capable of generating myriad expressions that play a pivotal role in establishing rapport with other people. Some of the muscles are moved consciously, some of them are not. NLP has an entire subfield regarding microexpressions, tying singular twitches of these muscles to internal mind states; it is not psuedoscience to think that a computer can be trained to do the same thing. What I worry is what controls that the experimenters went through to establish these foundations.

It’s ultimately moot, though. I think we can all see where this is going.



Tags: , , , , , ,

A Writer’s Voice

"Words have the power to both destroy and heal.
When words are true and kind, they can change our world."
—The Buddha

Or they can, you know, kill somebody.

I’ve gotten a sequence of complements and critiques on my writing lately, mostly complements, and it has led me to a mode of metacognition. That’s thought, about thought, and in this case an analysis of its mechanisms. The discovery that it has led me to is that, while I might not be capable of it yet, this mechanism can be documented.

There’s an old model, established by Malcolm Gladwell in his book “Outliers”, called the 10,000 hour rule. I haven’t personally read that book yet, but the phenomenon has wound its way into much of the text that I have, particularly in the realm of social psychology and cognitive neuroscience. The concept is that in order to acquire a fundamental precision and beauty in any art or technique, you need roughly 10,000 hours of practice. That’s just an estimate; some people reach their goal much sooner than others. However, no one is born knowing how to play a guitar or program a computer, and in the way of talent development and the learning process it’s more or less the case.

I have been writing since I was ten. I’ve indubitably learned a hell of a lot since then. My first independent attempt was to write down an idea I had, which struck me as valid for a novel format. I reached about two pages, from the beginning, before I got tired and had to stop. It was in crayon.

Come on, I was ten years old.

Whether or not that began a trek into literature that has accumulated 10,000 hours I do not know; I’ve had a lot of other things that I’ve been doing. However, I obviously wasn’t committing anything to memory at the time; my focus was on writing something down that could be committed by someone else, to their own memory. It hadn’t occurred to me that the sheer practice of the action could be teaching me something. The more important point, even if a secondary one, is that what I learned I may have learned on a subconscious level. I may not know that I know it.

Every writer has something known as a “writer’s voice“. That is, a specific dialect and personality that finds its way into their writing, by which they can be identified, like a psychological fingerprint.  I read a rather enjoyable book once, a combination effort of Neil Gaiman and Terry Pratchet, titled “Good Omens“. There was an afterword, not a part of the story but a story about the story, discussing the beautiful obsession that Neil and Terry pursued their work with, hammering away at every idea they had until they were too physically tired to continue. Neil took nights, Terry took days, on a clockwork schedule. All the same, I could feel the difference in the writing; when Pratchet was writing, I could tell, because he followed a specific pattern, whereas Gaiman followed a rather different one. There was a seam in the book, not a painful one but to an analytical writer it was noticeable. They were not the same man, their experiences and their base dispositions were different, their subtle lessons of the trade had been given by differing circumstances. The rules they followed were sharply distinct.

Guitarists have been described to me, in casual conversation, as having a “guitar voice” defining the way the guitar sounds in their hands; a difference in picking or the way they hold the string down that ever so subtly changes the timbre. It may not be limited to art. Perhaps programmers have a programmer’s voice, changing the nature of the code on the basis of experiences, good and bad, that they have had in the past. Mathematicians may have a mathematics voice, altering the way they approach problems on the basis of what they have done in the past. This is an innate part of our humanity.

So, as I walk home from my (dreaded but accepted) 4:00 AM job the other day, I begin to wonder, in idle meditation, about the metrics of my pursuit of short stories. If semantics can be measured quantitatively, regardless of whether it has, and through it potential story content is compared to story length, could a statistical relationship be formed? Could a thermodynamics of the creative mind yield usable results? Could I use it to better my own writing?

One such psychology text, “Introducing NLP“, by Joseph O’Connor and John Seymour, was on the subject of the communication of a thought from one mind to another. It presented me with four levels of technical capacity: unconscious incompetence, conscious incompetence, conscious competence, and finally unconscious competence. If I could roll back the lesson that jumped through raw experience into unconscious competence, the final stage of talent, to conscious competence and have it exist in both states at once, then I would have a manner in which to pursue a more academic study of my work. I could amplify my methods, and document the intricacies of the method for the rest of the world. How this is to be done, remains to be seen.


Leave a comment

Posted by on May 27, 2013 in Meta-Media


Tags: , , , , , , , , , , , , , , , , , ,