First, you should absolutely write couple of parsers by hand first and then repeat this exercise now and then.
I understand the reasons why author does not use parser generators. However, if you are writing a parser for serious production use I urge you to seriously consider parser generator instead of going manual route. Here is why/
Parser generators are akin to compilers. They require certain constraints to be met but in return they generate extremely efficient parsing code. For classes of languages for which parser generators exist, you cannot beat generator with handwritten code neither in terms of parsing nor in terms of maintainability.
Citing shift-reduce conflicts as one of the reasons to write parser by hand is akin to resorting to assembly being frustrated with C compiler errors.
Yes, there are cases when hand-written parsers are preferred. gcc switched for parsing of C/C++ from flex/bison to handwritten parser during 3.x and clang also has handwritten parser.
But this is because C and C++ are languages with context-dependent grammar and C++ syntax became increasingly arcane over the years. You constantly have to resort to tricks during C++ parsing. For example, to properly parse C++ class definition, you need to pass it two times, first reading declarations and only then both declaration and method bodies. You also need to resort to tricks and heuristics if you want to parse '>>' as part of nested template instead of right shift operator etc etc.
Almost always that kind of complicated, context-dependent grammar makes it possible (and in case of Perl, even very easy) to write WTF code.
Generated parsers are seldom the most efficient parsers; they can't use many tricks that can make hand-written parsers much faster, because they need to cope with the full generality of the language class they're targeting.
Maintainability is a moot point. The more complex your language, the bigger a maintenance benefit you get from a parser generator, providing it's expressive enough. For parsing C++ outside of a commercial compiler, I'd look at a GLR parser, for which the tables would most likely be tool-created. (In a commercial compiler, I'd be back to hand-written again.)
The value of being able to change your grammar and have your parser follow suit instantaneously isn't high past the prototyping stage. Other things will consume the parse tree, and depending on the tool, the parse tree's shape may be driven by the parse rules (ANTLR) or the parser actions may be more or less deeply embedded in the grammar and require refactoring themselves (most other tools). The downstream consumers of the structures almost certainly need modification too, since it's not likely you're just changing syntax sugar. Whereas if you have a hand-written parser, you can minimize the work needed to adjust downstream. You have more latitude for engineering.
It's great to use tools to validate a grammar, to prototype parsing it, and perhaps even for lightweight work like analysis. But when it's essential you have a 100% accurate semantic analysis, great error messages, excellent performance, deep tooling integration (e.g. IDE code completion), the more control you need over the parsing processes. It's closer to the critical path of success for your target market, and generators are too generic.
For me, parser generators work well for a certain range of applications. Given a range of complexity, with 1 being a date format parser and 10 being a commercial compiler with IDE integration, parser generators work well somewhere around 3 to 7. At the lower end, their costs in terms of integration, third-party dependencies etc. outweigh the complexity of the problem they're solving. At the higher end, you need a lot more out of the tool than it is designed to give you, and working around it causes more pain than anything you're saving.
I was a front-end engineer on the Delphi compiler for 6 years. I don't know of any major commercial compiler that uses a parser generator. Almost all use hybrid recursive descent.
Your comment of range of applicability is a very good one, it could be that I found myself within that range more often than not. I did write and maintain C++ parser that supports multiple dialects and self-recovering from syntax errors. It manly was written using flex/bison, but unavoidably used a lot of hand-written tricks.
Interesting, but how can one write a compiler with "hybrid recursive descent" ? I remember learning that most programming languages aren't actually generated by that class of languages (I don't know about Delphi specifically, but C++ and C are both not recursive-descent-parseable).
> context-dependent grammar makes it possible (and in case of Perl, even very easy) to write WTF code.
Do you really think that a context-dependent grammar will make for easier to understand code? Or more generally, do you really think that a language that is easier to parse is also easier for a human to understand?
...the human brain works very differently from your parser or lexer and code that may be very hard to parse for a computer may be very easy for a human to understand, and in reverse, code that is easy for the computer to parse can be virtually incomprehensible to a human.
If what you say were true, then we would all be using some kind of Lisp, as there's nothing easier to parse, but this is not the current reality. And don't tell me that we aren't because Lisp was inefficient or the AI-winter or anything like that. You can easily express a C-like language in a Lisp style notation. Complicated syntaxes that makes it harder to write parsers tend to be much easier for the human brain to understand, imho. Problems arise when some languages like Perl have rules that are "too relaxed" or "with too many exceptions" and this leads to WTF code.
In fact, looking at the kinds of notations that physicists and mathematicians use (you'd be surprised how "context dependent" mathematical language is, and by "context" I don't even mean lexical context, I actually mean "common set of assumptions held in the minds of most mathematicians about what each notation tends to mean in each context"), I'd say that the human brain is actually aggressively optimized for context-dependent languages!
Sure you can write incomprehensible context-free grammar, but yes, in my opinion, context free languages tend to be easier to understand or at least understand unambiguously. Natural language (which is obviously very context-dependent if you can even apply this name to a language without formal grammar) is easy to understand but is not a good language for giving instructions to a computer (cue classic "Time flies like bananas").
Interestingly enough, I personally found understanding Lisp, once I got the idea of paradigm, pretty much instantly. I do not think that the reason for us not using Lisp is syntax, but rather a combination of non-traditional paradigm and difference from the mainstream imperative languages. On top of that you do need to keep a lot of context while looking at Lisp program, but this is execution context, not the grammar context.
Basically, if you can simplify it to a standard, well understood, and high level construct you should. If you can't, you might be doing something wrong and adding unnecessary complexity.
I agree with the main point of the article but there are parser generators like OMeta http://tinlizzie.org/ometa/ that help (more than ANTLR/Lex/Yacc) to think in the grammar from a higher level point of view without paying a lot of attention about ambiguities and grammar restrictions. Sure, OMeta is slow but it offers some solutions to the problems presented in the article.
I was so impressed with OMeta when I read about it the first time that I decided to write my own implementation for Python. I got a first version released[1] but I sort of ran out of steam before I could take it any further. Luckily, there's another OMeta-based parser library for Python called Parsley which seems to be better-maintained (it's already hit v1.0!)
I used to mostly code parsers by hand for many of the same reasons, for years -- this article gives a nice rundown -- but the repetitive code really does kind of annoy, and I've found something else that works for me. Going over the listed problems:
Lexing and context: PEGs don't need a separate scanner; the option of calling an arbitrary function to parse a part usually suffices for unusual context needs.
Shift/reduce and grammar conflicts: PEG again sidesteps the problem, at the cost of sometimes resolving ambiguity in an unexpected way.
Mixed code: Semantic actions are denoted by function names instead of inline code. I've used the same grammar in different languages.
Other limitations: Given a small parsing library -- like one or two pages of non-golfed code -- it's more thinkable to hack it to address whatever particular problem comes up.
So most often these days I use https://github.com/darius/peglet when I have a parsing problem. It's definitely not for coding gcc with.
I use LPeg myself. What I like about LPeg is that you can compose it. Once I have an LPeg expression that parses, say, an IPv6 address, I can reuse that expression in a larger grammar.
I've used parser combinator libraries (usually based on Parsec), and it's super nice being able to write your grammar as code, very quick and easy, but I'm not sure if it helps with the problems listed, since I think you still need to manually write a bunch of code to process the parsed tokens.
Parsec will by default have to be slower, because it allows eg infinite backtracking / lookahead. If you use a parser combinator library that has less power, you can make it faster. E.g. if your library only has to exprose on Applicative interface.
I am working on a Parsec-like library for parsing regular languages (as in theoretical computer sience, not as in Perl). As opposed to grep I want to do something with the results instead of just accepting / rejecting, and I also want to expose more operators under which regular languages are closed, like difference or intersection (i.e. parallel match) or matching all elements of a set once but in any order, or chopping off a regular prefix or suffix.
It sounds to me like he wants a PEG parser - PEG is a formalization of a recursive descent parsers that operates in a single pass and can quite easily handle the "parse different things in different contexts" problem (like the inline assembly the author mentions).
It's not terribly mature yet, but I've been poking away at building a PEG parser generator myself (mostly as an exercise to learn C++11): https://github.com/bruceiv/egg
PEG parsers use backtracking and do not operate in a single pass. They use memoization to avoid the combinatorial explosion, but that trades one problem for another (though of lesser magnitude).
Recursive descent parsers are normally LL(k) and are O(n) in the size of the input - choices are made based on the next k tokens and no backtracking or alternate grammar paths are generally ever tried. Though they are not particularly efficient when rules nest deeply on the left before consuming tokens. It's better to use an operator precedence parser for parsing arithmetic expressions than use LL(1) rules to handle precedence, for example.
Having a separate lexer has other benefits besides eliminating explicit whitespace. It enforces consistency in the token set, which can otherwise seem arbitrary - a keyword has a special meaning in one part of the grammar, but not elsewhere. And this in turn helps with keeping backward compatibility when growing a language, because you can know for a fact that using a keyword elsewhere in the grammar doesn't introduce new problems.
PEG parsers use backtracking and do not operate in a single pass.
Backtracking + memoization is but one implementation technique for PEG parsers. Another such technique is akin to dynamic programming -- fill in a table with possible parsing outcomes as you go. No "backtracking" (which really, is moot in a memoized PEG parser) required.
Either way, the performance is identical to that of a recursive descent parser (linear in the size of the input and number of nonterminals).
Recursive descent parsers are linear in the size of the input, but they're also linear in the nesting depth of the grammar, unlike e.g. PDAs or operator precedence parsers, which is why they're a poor choice for things like math expressions.
And performance is not just time, but also space.
There are niches for most parsing algorithms, but PEGs are a poor fit for most tasks except lightweight ad-hoc use, especially implementation languages for which tool support is poor.
Perhaps, but it also makes it annoying to add new keywords while keeping backwards compatibility even if it can be done entirely unambiguously. For example, C++11 needed to resort to the awkward idea of "identifiers with special meaning" to add override and final.
By "single pass" I meant simply that you don't need a separate lexing phase (though as some of the other commenters have mentioned, it is possible to do a PEG parser in linear time using memoization or dynamic programming)
Off topic: in what context do so many people on here need parsers? Do you actually use them for your jobs or are all of these use case for pet languages? When reading comments for threads like this I cannot help but feel that every day a half a dozen new programming languages get released that I never hear about.
Not every parser is for a programming language. Often you need something to parse a specific file format you need to handle or made up. Although back at university my job was working on a large-ish modelling and simulation framework, and we had all kinds of parsers because each simulator needed some way of storing its initial state or the model configuration. Simply shoe-horning all of those things somehow into XML, JSON, YAML or CSV doesn't always work.
We use three DSLs to construct our application, all of which have custom hand-written parsers for performance reasons. One for extending UI, one for extending the core domain model and one for defining business workflows and events. We have 65,000 separate customisations at the moment so this is a big win for us.
You throw all these into a our metadata store and point the software at it and at runtime it throws out a desktop (WPF), mobile (web) and web version of our app which are all tightly integrated.
We built this because everyone of our clients needs heavy process level customisation yet we want to maintain a SaaS model.
Not toy languages here although you'll never see them on github.
Well, for a personal project I've written a parser for email (https://github.com/spc476/LPeg-Parsers/blob/master/email.lua) but I've also used LPeg for work to parse the logging output of some of our components (as part of the regression test) since they have a set structure (a mixture of "var=333" (example), "var=1/5/22" (minimum, average, maximum) and "time=33/55/88ms" (now with time units!).
Yes, it's nothing that a regex couldn't do, but I find the LPeg code easier to read and modify than a regex, and I don't have to do two passes over the data (one to break the input into multiple strings, then convert the appropriate strings into numerical values).
> I find context-free lexing to be a serious limitation on parsing.
What is this supposed to mean? Context-free and context-sensitive are well-defined terms in formal language theory, but OP seems to be using them in a non-standard way.
In any case, when we talk about lexers it's normally understood we're parsing a regular language, which is simpler than parsing general context-free languages (let alone context-sensitive) [1]. If you're doing something that cannot be expressed with a plain regular expression, then it's probably not lexing in the first place.
Age 37
Group 15-B
Phone +49.(0).123.456
Maybe I'm dense, but I can't see this would be problematic. If OP said what did he try, and why it didn't work, it would be nice.
No, I use many languages that care about whether parens (each one of which is its own token) are balanced. The language of balanced parens (a language that includes "" and "((((()()(()))())))" but not "()())(()" or "(((()))))") is a simpler language that also cares about parens being balanced. It was also the first language I saw in compilers class that was not regular.
Fair enough. I've definitely run into many of those issues too. It sounds like the features you're missing are:
- an ability to switch some blocks of input text to an alternative parser for another language or more free-form mode
- a better model of CFGs that reduces or eliminates unnecessary errors from the tool not understanding your particular normalization of the grammar
- better tools for working with the resulting ASTs
It sounds to me like the real issue is with the tools we have, not necessarily with parser generators as a concept. Maybe a more accurate title would have been "why today's parser generators don't work for me", because at the end of the day, unless your really good at cranking out sensible parsing code on your own, going the ad-hoc route seems to have a huge drawback when it comes to reinventing the wheel and maintenance of the resulting parsing code. The other major advantage of using a parser generator in my mind is that the resulting language is likely to be way more consistent and portable than something that's parsed ad-hoc. But maybe for simple languages, this isn't a big deal.
If the tools are still seriously lacking after several decades, that does point at a problem with the concept IMO. I don't think the core idea of a parser generator is an awful one, but I do think it's become clear that the classic lex/yacc approach is the wrong one, and while PEGs and things like Parsec as a good step in the right direction, I think another conceptual shift will be needed before they're unambiguously better than a handrolled parser.
Parser Generators have a bad reputation because they provide a leaky abstraction. In order to use an LR parser generator effectively you need to understand which languages the algorithm can handle.
Once you do, most of the problems vanish. For instance, in my experience, shift/reduce conflicts work almost like an integrated debugger. They highlight ambiguities in your grammar and most parser generators can produce a detailed report of the conflicts.
If, however, you do not understand the algorithm you are using and are for instance formulating your grammar to be completely right linear... well, you will get spurious shift/reduce conflicts.
The same applies to the lexer. Yes, a lexer can only process regular languages. Anything beyond that can be done in the parser. The parse tree generated by a tool like yacc is often messy and full of details. You can handle this by having an additional, hand written, pass which transforms the yacc output into the actual internal data structure of your compiler. Which, outside of toy examples, had better not be an abstract syntax tree. :)
the clojure project links to https://github.com/epsil/gll, the last section of which has suggestions and useful links for implementing the algorithm in other languages
Most modern parser generators are capable of more than the author gives credit for. Bison/Flex, for example, can handle most of the issues mentioned (feedback from the parser back to the lexer for context-sensitivity, Flex start conditions for grammars within grammars, %prec to explicitly resolve conflicts).
A project would need to have a very simple grammar or very stringent performance requirements to consider writing a parser by hand. "Parser generators can't handle my grammar" is usually a bad reason, although there are the rare exceptions.
> I really don’t want to need a post-processing
> phase which massages the resulting tree.
Weird, it was much easier for me to simplify the tree after the fact. As for anything else this is pretty much how I feel about yacc/bison as well. You have to invest way too much time to understand their internal workings to do anything non-trivial and it's just not worth it.
Are there many (computer) languages which don't use a text file (in some encoding or other) as their 'normal' representation?
If the compiler interpreter wants a tokenized input, perhaps we could/should save in that format?
Obviously the code-entry system (not sure that it's an 'editor' at this point) has to have an efficient way to let you select which tokens to enter (and enter free text in allowed places, i.e. string literals and comments).
That could either be 'one token per key' a la ZX Spectrum Basic (the only system I'm aware of which works this way): http://en.wikipedia.org/wiki/ZX_Spectrum#Firmware or something which looks the same as auto-complete to the end-user.
The typing experience would be much like a modern IDE, but would not allow you to enter or save incorrect text strings where a token was required, and would not require a lexing step (since it would be saved as tokens, or possibly even as an AST).
Interesting, thanks. Note however that the ZX spectrum also did tokenized input. Each key corresponded to a different basic token: http://fms.komkon.org/Speccy/SpeccyKeys.gif.
In the context of lexing and parsing, I was wondering about this idea. The program entry system is kind of like a potentially context-aware parser. No more syntax error are possible...
You may be interested in a parser generator someone at my uni was working on. It generates a possibly-non-deterministic parser which can be disambiguated at the semantic level. Paper here: [0], Source code: [1] It is unfortunately in Java, and I don't know if you want to go down the parser-generator rabbit hole, but if so -- it may be interesting (and it attempts to address the problems you've encountered, so it seems relevant).
Parser generators are useful to explore the problem space. It allows for a higher-level thinking. I always ended up rewriting the parser by hand but I'm not sure I could achieve the same result without the first prototype.
This is the way I usually end up doing things as well. I originally tested out the text grammar for later.js using a parser generator. Then once I worked out the kinks and played around with it a bit, I rewrote the parser so that it was specific to the grammar I needed with some additional flexibility that wasn't available from the generator.
I always used to laugh heartily when installing Slackware (back since the 3.0 days, ~1995?), the package descriptions would come up as the floppy disks were being read, and the most ridiculously inpenetrable one of all read: bison: A parser generator in the style of yacc. Of course, I did try to read the man page but found it mostly unenlightening (ie. being too far from present knowledge) at the time.
I understand the reasons why author does not use parser generators. However, if you are writing a parser for serious production use I urge you to seriously consider parser generator instead of going manual route. Here is why/
Parser generators are akin to compilers. They require certain constraints to be met but in return they generate extremely efficient parsing code. For classes of languages for which parser generators exist, you cannot beat generator with handwritten code neither in terms of parsing nor in terms of maintainability.
Citing shift-reduce conflicts as one of the reasons to write parser by hand is akin to resorting to assembly being frustrated with C compiler errors.
Yes, there are cases when hand-written parsers are preferred. gcc switched for parsing of C/C++ from flex/bison to handwritten parser during 3.x and clang also has handwritten parser.
But this is because C and C++ are languages with context-dependent grammar and C++ syntax became increasingly arcane over the years. You constantly have to resort to tricks during C++ parsing. For example, to properly parse C++ class definition, you need to pass it two times, first reading declarations and only then both declaration and method bodies. You also need to resort to tricks and heuristics if you want to parse '>>' as part of nested template instead of right shift operator etc etc.
Almost always that kind of complicated, context-dependent grammar makes it possible (and in case of Perl, even very easy) to write WTF code.