In a previous lifetime, I worked on the code generator for a compiler for an imp...

bradleyland · on Dec 19, 2010

I'm sorry, but you have an incorrect view of HTML. HTML's sole purpose is not display. If that were the case, it would be a simple matter of identifying what renders fastest and most accurately, but again, that's not HTML's sole purpose.

The destination goal for HTML is to convey information about the structure and context of the content, not just how it is displayed. So we must care about the HTML generated. It is requisite to it's function in providing additional machine-readable information.

wglb · on Dec 20, 2010

So what about the information and structure of the HN site is missing by the way that it is currently displayed? What additional machine-readable information is missing in the way that HN (or really any other site) produces the information? Search?

There are a number of aggregator projects that various HN members have built by scraping this "broken" html and they seem to work quite nicely.

I spent a couple of years deep in the SGML world and am fully cognizant of all the arguments about how content needs to be completely separated from presentation. HTML is really a weak sister in that world. I don't think my view of HTML is incorrect.

In the real world, the ship of requiring correct HTML from a gramatical perspective left the harbor back in the 90's. If what you say were true, browsers would refuse to render broken HTML.

bradleyland · on Dec 20, 2010

You clearly know the answer to your own question, but you don't think it's important. I'd ask you to take a step back and have a look at what I'm saying. I'm saying that I understand why pg has set his priorities as he has. All I'm suggesting is that we be honest about it. Let's not say it "doesn't matter".

Currently, aggregator projects work with HN because they know and understand the HN markup specifically. In an ideal world (and one in which we don't live, obviously), a "scraper" library should be able to identify things like comment streams based on contextual information. Think of the power that comes just from having indexes that are able to identify the title and content body. Now, what if we take that a step further and build an indexer that can recognize comments. One that can infer that one comment is made in reference to another based on its nested hierarchy. Are tables the right structure for that?

I'm asking you to dream. I'm asking you not to be complacent with the tools that "work" today. That's all. If you're content to use what you've got, and you don't care if we ever end up with markup that enables these powerful new ways of relating to data, fine, but don't say it doesn't matter. It matters.