Sunday, September 13, 2009

Interpreter, Going Open, and Milestones

As I mentioned in my previous post, I've made my compiler available on github: jdfrens/hobbes. I haven't officially put a license on it yet, but it'll be a "don't blame me for bad things, yet give me credit if you use it".

I thought it would be interesting to start working on a interpreter as well. I've written many interpreters the last couple of years (since "Programming Languages" had been my course), but getting it to coexist with my compiler might be interesting. I'm also curious what re-use I might find in my code. (The interpreter could be used for constant folding.) Plus, I want to see how well CIAT can deal with two different sets of processors. (Short answer: quite well! More on this later.)

Finally, I'm trying out Pivotal Tracker to keep track of the stories for my interpreter and compiler. I'm thinking that I might post summaries for each true iteration there (i.e., once a week). I'll have to have another post on how I've abused the term "iteration" here in the past.

Friday, September 11, 2009

Long time, no post!

Since my life has settled down into merely "finding a job" (since I completed "frantically packing and moving"), I thought it would be a good idea to resurrect some of my old projects to keep and improve my programming skills. I've picked up some new ideas for interpreting and compiling since I posted last (nine months ago), so this blog gets resurrected along with my NRHC.

I accomplished two things today: I finished iteration #5 for the NRHC, and I moved my code to github! Check out my hobbes repository. Soon I'll post a blog entry about iteration #5.

Saturday, December 27, 2008

NRHC Story #4: Multiplication

User Story

A Hobbes program can be a single multiplication of integers or command-line arguments.

Analysis

My NRHC compiler could already add two integers, so adding the ability to multiply two numbers should be pretty easy. Just make the addition-generating code more generic, right?

Implementation

Mostly right. In terms of generating PIR code, yes, all I had to do was make the code a bit more generic. I was surprised, though, that the additional token type MULTIPLY complicated the back end. It felt to me like I was worrying about this extra token type more often than I should have to.

Now, granted, I'm using ASTs like (+ 1 2) and (* 3 4). I don't have generic OP ASTs: (OP + 1 2) and (OP * 3 4). I suspect that this would make my code just a little bit simpler. However, I'm sure I'll be better off moving to a tree intermediate representation.

Next Story

My next story will be to allow any number of additions and multiplications. But first, I'm going to refactor my code to use TIRs in the back end. I'll say more about TIRs in a future entry.

Friday, December 26, 2008

CIAT version 0.3.0

I released CIAT 0.3.0 today; it is almost all internal and API changes.

The only output change is displaying standard error messages from a Java-implemented compiler. This is useful (necessary even) for debugging.

The input format didn't change at all.

The biggest internal change is to make each processor responsible for its own processing. (A processor is something like "a compiler implemented in Java" or "the Parrot Virtual Machine".) Each processor is also responsible for reporting whatever it finds important, like source code, compiled code, generated code, execution, error messages, etc. As I expected, this led to some code duplication, but for the time being I'm okay with that. Eventually, I suspect the duplicated code will end up in a Ruby module for a mix-in.

I feel a lot better about the design, and as I add new features, I'll be trimming away the duplicated code and unnecessary abstractions I have now.

Sunday, November 2, 2008

A Parsing DSL in Ruby

I watched about half of a video about a parsing library written in Ruby, entitled (appropriately enough) Grammar: A BNF-like Ruby DSL for Parsing (project on RubyForge). It seems kind of interesting, but it lacks a certain elegance. I'd argue that ANTLR is a bit more elegant, but as I understand it Parsec for Haskell is sheer elegance.

Thursday, October 30, 2008

NRHC Story #3: command-line arguments (again, again!)

So I'm done with Story #3 (started here and continued here). Fixing my "problems" from the last time were not as difficult as I had at first anticipated. (This is most evident since I'm posting this final follow-up the same afternoon!)

Implementation (again)

The Parser. I had to tweak the parser so that ARGV references could appear in an addition expression. The trick here was that I couldn't just make the addition expression recursive since that would allow addition expressions in addition expressions. I don't have a story for that feature yet, and there are too many other issues (associativity for one) that come up, so I couldn't easily add it.

But this lead me down an interesting and good path. A literal integer and an ARGV reference are both simple expressions. It's simple expressions that can appear in an addition expression. So a little bit of refactoring, and the grammar was whipped into shape.

Code Generation. Not much had to change with code generation. I was already putting the value of an ARGV reference into a PIR register, the only trick was getting rid of extra argv declarations. As I described in my previous post, I need this directive in my PIR code when I want to use command-line arguments:

.param pmc argv

Originally, this directive was spit out each time there was an ARGV reference, and it was spit out just before each ARGV reference. So I'd end up with code like this:

.param pmc argv
$I0 = argv[1]
.param pmc argv
$I1 = argv[2]

The first change I made was to make a separate pass on the AST to generate the directive (and in the future any other directives for a function prolog). So I had this:

.param pmc argv
.param pmc argv
$I0 = argv[1]
$I1 = argv[2]

My idea at this point was to leave the extra directive in the code from this phase of the compiler. It's not necessary for me to clean this up (or prevent it) right away; that would be giving the code generator too much responsibility.

Linear-Code Analysis. I broke out the linear-code analysis into a separate phase from the actual writing-code phase, and I suspect this will become an official linear-code analysis and optimization module.

So far, my linear-code analysis and processing consists of inserting tabs when needed and removing unnecessarily duplicated instructions. I've taken the approach that duplicated instructions are by default acceptable (e.g., two instructions that increment $I0). So I remove instructions that I know should not be duplicated, and so far that's just the argv directive.

Other Observations

I'm quite surprised that I haven't felt a burning need to switch to a tree intermediate representation, just processing from the ANTLR AST I get back from the parser. I'm certain, though, that one more language structure that requires a new case in one of my switch statements will put me over the edge. However, I think my next phase might be to add multiplication to the mix. This should not require a switch to a TIR.

I also discovered a slight annoyance with CIAT. Somewhat at the last minute when releasing v0.2.0, I tweaked the way it handles standard error, by dumping it into a file. Unfortunately, this means the standard error does not appear on the console, and I have no reference to it in the report.

NRHC Story #3: command-line arguments (again)

I write this blog entry while I work on this story. If you need to refresh your memory about Story #3, check the analysis post.

I've decided to roughly time my process for this implementation. However, I have many distractions: several of my students are going through mock interviews today, and I have to be on hand to handle fires; other students email me with questions and requests (like getting access to a subversion repository); and I'm trying to rip three versions of The Wall to my computer today. And I'm writing this blog post as I go (although I'm not counting those minutes).

The Implementation

The Lexer. I started with the lexer. To fully parse ARGV[1], I figured I could do it all in the lexer or break it apart in the lexer and parse out integer in the parser. Doing everything in the lexer seems simpler since it's done in one place, but in thinking about it some more, I realized that it puts too much responsibility on the lexer, making it more complicated. It doesn't hurt that this will mostly be handled by the parser anyway in the future!

So I needed to recognize ARGV, [, and ]. (The compiler already recognizes integers.) With a fair number of interruptions and distractions, this took me 40 elapsed minutes. It was probably more like 10 or 15 minutes of real work.

The Parser. So here's a surprise. I did too much work for the lexer. I finished the parser in about 10 minutes, and I didn't need to use the new tokens from my lexer. This suffices in ANTLR:

expression : 'ARGV[' INTEGER ']' -> ^(ARGV INTEGER)

I don't need the ARGV for any reason after the parsing, so why bother to recognize it separately? I did however, create a temporary AST type named "ARGV" as seen in the rewrite rule.

So I took out the changes in my lexer, and I do the work in the parser. I know it will change later, but since I'm not sure how that will play out, I'll toss what I did now.

Compilation. According to the clock, this took me around 40 minutes or so, but part of that time was spent talking to a delusional Packers fan about football. (For the record, anyone not a Bears fan is, by definition, delusional.)

A couple things struck me while working on this. First, my compiler is working with the raw ASTs from ANTLR. They're not the nicest structures to work with since they're so generalized. More importantly, they don't lead to good OO code at all; it's all switch statements. The thing is, I'm not sure when I should switch over to my "tree intermediate representation" that I've cooked over the last few years. These TIR objects implement the visitor pattern, so the various compiler algorithms are nicely implemented in an OO way.

Second, command-line arguments in PIR (that's "Parrot Intermediate Representation") required a compiler directive:

.param pmc argv

The thing is, this should be necessary only once in a PIR subroutine. I'm not sure what the Parrot interpreter will do if I include it more than once. In an addition expression, ARGV[0] + ARGV[1], it matters! I could always include the directive, but that would mean changing all of my previous acceptance tests. I balk at that not because of the work involved (ten minutes, tops) but because it suggests I should be looking for a different solution. So I need to figure a way to add that directive zero or one times as necessary. This is going to require more thought.

And this brings me to the third thing that I've discovered: I'm implementing only half the story! I can use a command-line argument only in place of a single integer; I cannot use them in an addition expression! Ooops.

I had to go back to see how I had phrased Story #3, and it does, in fact, promise the use of a command-line argument in place of any literal integer.

There is this tension in writing a story between making it too trivial and too much. Story #3 is the right size in any practical situation. However, it does not mean I need to implement the whole story. So this is a good place for me to break and to close out this blog entry. I'll finish the rest of Story #3 and post a summary of the results.