In a previous lifetime, I worked on the code generator for a compiler for an implementation language. We had a rule that said "A good compiler generates assembly code that an assembly programmer would be fired for writing." Early releases of that compiler would get bug reports from the OS implementation team that "this code is wrong". We would sit down and walk through the generated code, and there would be an "Oh" moment where the programmer would realize that it was simply different, and in all cases, faster.
Similarly, the code generated by good Python, to a CPU, looks like crap because it repeats stuff, allocates and releases memory, makes unnecessary calls, swaps stuff between stack and registers, and a whole host of other sins.
But we don't care, because what we care about is the end result: the Python or Lua or Ruby code generates lovely computational results. Perhaps 1% of spend time looking in detail at the actual machine code generated. The purpose of these languages is to 1) Let us write good, readable code and 2) generate a useful, sometimes beautiful result.
Similarly, the purpose of this particular Arc program is to generate a useful result, that is, a readable, useful set of html files that our browsers render readably.
So really, "it doesn't matter" if the assembly language is junk, or if the html is not something early versions of Patrick would turn out with notepad.
Think of the generated html as assembly language. It no more degrades our art form than the goofy stuff that modern languages make our CPUs eat.
I'm sorry, but you have an incorrect view of HTML. HTML's sole purpose is not display. If that were the case, it would be a simple matter of identifying what renders fastest and most accurately, but again, that's not HTML's sole purpose.
The destination goal for HTML is to convey information about the structure and context of the content, not just how it is displayed. So we must care about the HTML generated. It is requisite to it's function in providing additional machine-readable information.
So what about the information and structure of the HN site is missing by the way that it is currently displayed? What additional machine-readable information is missing in the way that HN (or really any other site) produces the information? Search?
There are a number of aggregator projects that various HN members have built by scraping this "broken" html and they seem to work quite nicely.
I spent a couple of years deep in the SGML world and am fully cognizant of all the arguments about how content needs to be completely separated from presentation. HTML is really a weak sister in that world. I don't think my view of HTML is incorrect.
In the real world, the ship of requiring correct HTML from a gramatical perspective left the harbor back in the 90's. If what you say were true, browsers would refuse to render broken HTML.
You clearly know the answer to your own question, but you don't think it's important. I'd ask you to take a step back and have a look at what I'm saying. I'm saying that I understand why pg has set his priorities as he has. All I'm suggesting is that we be honest about it. Let's not say it "doesn't matter".
Currently, aggregator projects work with HN because they know and understand the HN markup specifically. In an ideal world (and one in which we don't live, obviously), a "scraper" library should be able to identify things like comment streams based on contextual information. Think of the power that comes just from having indexes that are able to identify the title and content body. Now, what if we take that a step further and build an indexer that can recognize comments. One that can infer that one comment is made in reference to another based on its nested hierarchy. Are tables the right structure for that?
I'm asking you to dream. I'm asking you not to be complacent with the tools that "work" today. That's all. If you're content to use what you've got, and you don't care if we ever end up with markup that enables these powerful new ways of relating to data, fine, but don't say it doesn't matter. It matters.
Similarly, the code generated by good Python, to a CPU, looks like crap because it repeats stuff, allocates and releases memory, makes unnecessary calls, swaps stuff between stack and registers, and a whole host of other sins.
But we don't care, because what we care about is the end result: the Python or Lua or Ruby code generates lovely computational results. Perhaps 1% of spend time looking in detail at the actual machine code generated. The purpose of these languages is to 1) Let us write good, readable code and 2) generate a useful, sometimes beautiful result.
Similarly, the purpose of this particular Arc program is to generate a useful result, that is, a readable, useful set of html files that our browsers render readably.
So really, "it doesn't matter" if the assembly language is junk, or if the html is not something early versions of Patrick would turn out with notepad.
Think of the generated html as assembly language. It no more degrades our art form than the goofy stuff that modern languages make our CPUs eat.