Cool I didn't know this was a common pattern. I recently saw the same approach implemented in scala.meta [1] - it allows you to view the code as both parsed tokens with all syntax intact as well as more abstracted ASTs which only carry semantic meaning. Someone even built a code formatter called scalafmt[2] like the author mentions.
Its a really cool approach because I think we need to pay much more attention to making more/better structured data from the compiler available to tools.
Back when we started scala.meta token-level granularity wasn't available in most metaprogramming frameworks. Clang [1] and Roslyn [2] seem to be the first two major industry-grade compilers that use this approach and re-use the compiler as the foundation for extensible tooling APIs.
From [2] - Incidentally, these are called “red/green trees” because those were the whiteboard marker colours we used to draw the data structure in the design meeting. There’s no other meaning to the colours.
I love reading tidbits like that from other teams' work.
Cool! Yes this is what I was getting at. I mentioned Clang and Microsoft's IDEs in the "related work" section. I haven't seen them describe their data structures very much.
Clang has some documentation, but it's not very clear. For example, this doc has a FIXME(describe design) on it.
We've not completely solved the problem of language complexity creeping in to the metaprogramming toolkit but we do have quasiquotes [1] to partially address the pain of ast constructors and destructors.
e.g. For code snippet "class C(x: Int)" Scala compiler create a tree that roughly resembles the following code:
class C extends scala.AnyRef {
<paramaccessor> private[this] val x: Int = _;
def <init>(x: Int) = {
super.<init>();
()
}
}
This tree looks even more terryfing under the hood in terms of AST constructors. Instead of making people figure out how that works we support nice high level sugar instead:
q"class C(x: Int)"
Where q is a magic string interpolator that is compiled to generate an equivalent AST. It can be used for both construction of new AST nodes based on the older ones (we use $name syntax to substitute thing in) and also deconstuction of existing ones into smaller pieces via pattern matching ($name extracts parts out.)
They use :(expr) or quote/end for quotation, and $var for interpolation. Elixir metaprogramming uses quote/end for quotation, and unquote for interpolation. (In fact the entire Elixir language appears to be done with AST metaprogramming, since it's on top of Erlang.)
I basically think of these schemes as "Lisp-like AST metaprogramming, but with Syntax". Thanks for the pointer on Scala.
I saw a video where people asked why Clang source tools generate textual changes rather than AST changes... and this is a good example. People for some reason think that ASTs are "cleaner" or more usable, but they can be a pain.
Have you looked at gofmt at all? I remember reading somewhere that it was a deceptively difficult project. Perhaps they weren't using the best data structure either?
I haven't yet but I plan to (this needs a wiki page). Part of the motivation here is to avoid writing two parsers. I believe that Go had a parser in C for compilation and then a parser in Go for reformatting, but I want to avoid that duplicate work.
I've looked a tiny bit at clang-format, but it's large and uses the large Clang AST.
The difference between gofmt and dartfmt is that Go doesn't impose any line length. And the Dart async style leads to a lot of chaining, which means more line break possibilities to consider.
I want to write an auto-formatter and it will be somewhere in the middle. It will support a max line length, so it probably needs these optimization functions like TeX and dartfmt, but the language doesn't have this async chaining pattern.
And now I have a pointer to scalafmt. Any other pointers are welcome!
"Part of the motivation here is to avoid writing two parsers."
Have you considered implementing a parser generator instead or taking the higher-order route and constructing a parser from higher-order primitives? Both ways provide a lot of flexibility to produce different trees for different problems, save a bit of time and a lot of code. Languages are mostly sequences, alternations, repetitions and characters anyway.
Yes, I'm thinking of writing my own parser generator / meta-language for the Oil language. Now that I have experience with OSH/bash, I can see what I need to support.
On the one hand, I know how to write the parser by hand for Oil. Now that I've written the parser for OSH, it's not too much work, and not too much code. The thing that might push me over the edge is handling "broken" code like Microsoft's Roslyn, but I may put that off for v2 and just get the shell working.
The first half of my blog is kind of about how nothing off the shelf like ANTLR or yacc will work. I will have to write my own.
There are posts there about parsing in a single pass, undecidable parsing, four mutually recursive sublanguages, lexer modes/lexical state, using two tokens of lookahead vs. arbitrary lookahead, and Pratt parsing for experssions.
Oil won't be as complicated as OSH/bash, but it at least needs lexer modes and Pratt parsing.
I actually thought about trying a bounty / crowdsourcing for a meta-language that can express Oil in the fewest lines of code (with no dependencies). My claim is that no existing meta-language can handle it.
Iirc sean mcdirmid deals with non ast systems, so you interact at the user level when processing code (important to minimize changes so the user can track what's happening).
Technique as old as dirt. My thesis advisor used it in the 60s and 70s.
His favored approach had one artificial limitation, perhaps a remnant of the age: he limited the size of source files in bytes to some power of 2. This let him represent each token (incl. white space/new lines/comments) as a pair of fixed-size integers indexing into the file bytes. The tokenized file is an array of those pairs and the concrete syntax tree is a tree where leaf nodes are indices into the array of tokens.
Suitable for syntax-directed code generation, control-flow graph generation, static analysis, linting, pretty-printing. Super memory compact and even has an upper bound on memory footprint. The caveat is that the source language does need to support composing a "module" out of multiple files because of the limit on source file size.
I'm sure it's as old as dirt, but I don't think there is a good name for it. "Concrete Syntax Tree" is not a good name for the reasons pointed out in the article.
Do you have reference for this? I saw this problem mentioned in the write-up on the ZINC Abstract Machine by Xavier Leroy (author of OCaml). But I don't know of other papers that talk about this.
Also I believe that most open source tools do NOT have this functionality. Look at lib2to3. It's bolted on -- not exactly a clean design. Most open source front ends are not designed for tooling like Clang is.
My source is in-person conversations with a real-life human being, and working on one of his codebases that employed the technique. If you want to look up his work, his name is Bill McKeeman. I personally have never felt compelled to find secondary sources when I had primary sources.
Edit: I'm personally not surprised that open-source codebases don't employ this technique. Lots of great PL work was done for private companies until the 90s and while lots of work was published in papers and books, precious few open-source PL communities historically drew from academia. I'm sure you know the counter-examples.
Its a really cool approach because I think we need to pay much more attention to making more/better structured data from the compiler available to tools.
[1] http://scalameta.org/tutorial/ [2] https://github.com/olafurpg/scalafmt