Cool! Yes this is what I was getting at. I mentioned Clang and Microsoft's IDEs in the "related work" section. I haven't seen them describe their data structures very much.
Clang has some documentation, but it's not very clear. For example, this doc has a FIXME(describe design) on it.
We've not completely solved the problem of language complexity creeping in to the metaprogramming toolkit but we do have quasiquotes [1] to partially address the pain of ast constructors and destructors.
e.g. For code snippet "class C(x: Int)" Scala compiler create a tree that roughly resembles the following code:
class C extends scala.AnyRef {
<paramaccessor> private[this] val x: Int = _;
def <init>(x: Int) = {
super.<init>();
()
}
}
This tree looks even more terryfing under the hood in terms of AST constructors. Instead of making people figure out how that works we support nice high level sugar instead:
q"class C(x: Int)"
Where q is a magic string interpolator that is compiled to generate an equivalent AST. It can be used for both construction of new AST nodes based on the older ones (we use $name syntax to substitute thing in) and also deconstuction of existing ones into smaller pieces via pattern matching ($name extracts parts out.)
They use :(expr) or quote/end for quotation, and $var for interpolation. Elixir metaprogramming uses quote/end for quotation, and unquote for interpolation. (In fact the entire Elixir language appears to be done with AST metaprogramming, since it's on top of Erlang.)
I basically think of these schemes as "Lisp-like AST metaprogramming, but with Syntax". Thanks for the pointer on Scala.
I saw a video where people asked why Clang source tools generate textual changes rather than AST changes... and this is a good example. People for some reason think that ASTs are "cleaner" or more usable, but they can be a pain.
Have you looked at gofmt at all? I remember reading somewhere that it was a deceptively difficult project. Perhaps they weren't using the best data structure either?
I haven't yet but I plan to (this needs a wiki page). Part of the motivation here is to avoid writing two parsers. I believe that Go had a parser in C for compilation and then a parser in Go for reformatting, but I want to avoid that duplicate work.
I've looked a tiny bit at clang-format, but it's large and uses the large Clang AST.
The difference between gofmt and dartfmt is that Go doesn't impose any line length. And the Dart async style leads to a lot of chaining, which means more line break possibilities to consider.
I want to write an auto-formatter and it will be somewhere in the middle. It will support a max line length, so it probably needs these optimization functions like TeX and dartfmt, but the language doesn't have this async chaining pattern.
And now I have a pointer to scalafmt. Any other pointers are welcome!
"Part of the motivation here is to avoid writing two parsers."
Have you considered implementing a parser generator instead or taking the higher-order route and constructing a parser from higher-order primitives? Both ways provide a lot of flexibility to produce different trees for different problems, save a bit of time and a lot of code. Languages are mostly sequences, alternations, repetitions and characters anyway.
Yes, I'm thinking of writing my own parser generator / meta-language for the Oil language. Now that I have experience with OSH/bash, I can see what I need to support.
On the one hand, I know how to write the parser by hand for Oil. Now that I've written the parser for OSH, it's not too much work, and not too much code. The thing that might push me over the edge is handling "broken" code like Microsoft's Roslyn, but I may put that off for v2 and just get the shell working.
The first half of my blog is kind of about how nothing off the shelf like ANTLR or yacc will work. I will have to write my own.
There are posts there about parsing in a single pass, undecidable parsing, four mutually recursive sublanguages, lexer modes/lexical state, using two tokens of lookahead vs. arbitrary lookahead, and Pratt parsing for experssions.
Oil won't be as complicated as OSH/bash, but it at least needs lexer modes and Pratt parsing.
I actually thought about trying a bounty / crowdsourcing for a meta-language that can express Oil in the fewest lines of code (with no dependencies). My claim is that no existing meta-language can handle it.
Cool! Yes this is what I was getting at. I mentioned Clang and Microsoft's IDEs in the "related work" section. I haven't seen them describe their data structures very much.
Clang has some documentation, but it's not very clear. For example, this doc has a FIXME(describe design) on it.
http://clang.llvm.org/docs/LibFormat.html
I also think the Clang AST is absolutely enormous and thus hard to document.