Friday, 28 August 2015

A nanopass compiler in Haskell (part 2 of many)

Problem Outline


From the rough introduction in the last post it is clear that we want to define multiple representations for use in different phases of the compiler. By itself this desire does not cause any problems: Haskell lets us declare as many datatypes as we want. So in a naive manner we could jump right in:
data Tree1 = Lit Integer | Var String
           | Add Tree1 Tree1...
data Tree2 = Lit2 Integer | Var2 String Visibility
           | Add Visibility Tree2 Tree2 ...
data Tree3 = Lit3 Integer | Var3 String Type
           | Cast Type Tree3 ...

Sure enough, Haskell will let us type a lot of code quickly, but it is not necessarily a good way to proceed. Even in this simple example we can see several issues start to appear.

  • Constructors need unique names, so we are generating arbitrary instances of syntactic families (Lit,Lit2 etc). This means that we are reading something into the relationship between these constructors (and their types) that does not exist in the program.
  • Both Lit and Var are simple constructor cases - they will represent leaves in the program tree. The Add and Cast constructors are more complex because they represent trees, so their fields include recursive definitions. 
  • Both Tree2 and Tree3 extend Tree1 in some way: Tree2 adds data-flow properties to nodes in the tree (e.g. for analysing information leakage) while Tree3 adds type-information to the tree. Both the Tree2 and Tree3 datatypes include a common core that is similar to Tree1, but this relationship is only expressed syntactically.
  • Tree2 extends Tree1 in a monotonic way: each of the constructors in Tree1 has an equivalent in Tree2. Tree3 extends some part of Tree1, but it does not include an equivalent to the Add constructor. The set of nodes in Tree3 overlaps the set of nodes in Tree1, but it is not a subset.
There are few Haskell programs with datatypes but no functions, and so next we can consider what happens as we start to write functions over these types. There will be at least one parser in the compiler: the initial representation will be a parse-tree. If any of the later phases can serialise/de-serialise their representation then the will define further parsers.
parse :: String -> Maybe Tree1
Each representation that is not constructed from a parse will be the result of transforming an earlier representation, these transformation functions will be responsible for filling in default values in fields the later representation introduces.
build2 :: Tree1 -> Tree2
build3 :: Tree2 -> Tree3
Lastly we will expect functions that transform a representation into another kind of data - the classic example is a pretty printer that converts the datatype into a String containing a human-readable form.
pretty :: Tree1 -> String
pretty :: Tree2 -> String
pretty :: Tree3 -> String
Already it is clear that these groups of functions will reimplement the same functionality. Just as the datatype definition included a syntactic equivalence so will each group of functions. In the worst case this means that code will be copied from function to function, introducing all of the problems of poor reuse of code. It would be better to specify the unique functionality only once, and express the reuse to the compiler within the language. This is a fairly standard problem in generic programming.

Wednesday, 26 August 2015

A nanopass compiler in Haskell (part 1 of many)

What is the right way to write a compiler? There are only two cases to consider. In the first case the language that we are compiling is well understood. It's old. It's made of old pieces. Our probability of being surprised by what we need during the implementation is low. It makes some sense to say that the entropy of the language is low. In this case our main concern will be performance: both in generating the highest quality target code, and in doing so as quickly as possible. This case generally corresponds to working commercially.

The second case interests me more. The language that we are compiling from is new. It tries something that we have not seen before. The probability of being surprised by what we need to do to is high. For the purpose of this discussion we can say this kind of language has a high entropy. The main concern here is not performance, but clarity and simplicity. We are exploring unknown territory, and to do so we need to isolate pieces of functionality that we require with the loosest possible coupling between them. Doing so is the shortest route to finding the combination of them that is required to solve our problem.

Each case indicates a different desired structure in the compiler. If we want to optimise performance using well-understood pieces then the top-down design implied by a monolythic structure is a good fit. A small number of phases in the compiler requires a smaller number of intermediate representations. Fewer conversions result. More work is done inside each phase benefiting the spatial and temporal locality of the phase, and thus increasing its performance on a typical modern architecture.

If we want to enforce to maximise flexibility and simplicity then each phase should do the smallest unit of work that it can. This implies that the number of intermediate representations will be larger. Roughly speaking this is the style of implementation that I was taught as an undergraduate. By that time (around 2000) it had become folklore that it was easier to manage the complexity of writing a compiler by writing complete, simple, passes over in-memory storage. This was a stark contrast with the accepted wisdom of a previous academic generation that multi-pass algorithms should be avoided because of the performance characteristics of an earlier memory hierachy.

That folklore was an understanding that the environment a compiler writer operates within had shifted, and that a new approach would yield better results. Here we are 15 years later and the term "Nanopass compiler" has been coined to describe this approach. I would like to describe some of the difficulties that occur when this approach is attempted in Haskell. Various solutions have been proposed for ways to encode this stucture into Haskell. My purpose in writing this is to document how much these solutions cost in implementation effort.