> The best feature of SuperHTML is that it is based on the official HTML living ...

notpushkin · on Sept 16, 2024

> a more-robust strict subset of the formal HTML spec

I still think we’d be better off just using XHTML. There are some practical problems with XHTML5 (e.g. there’s no named entities support for some reason), but the syntax makes sense at least.

zarzavat · on Sept 16, 2024

That was tried 20 years ago and it turns out that humans are not good at writing XML.

XML makes sense if you are authoring HTML in an editor. However, this is not how most HTML is actually produced. It's mostly produced by templating engines. This means that you can't validate the XHTML during development because it's being generated on the fly. You only find out that it's valid in testing or production, perhaps only for a subset of users in certain situations. With HTML this is OK because there is error recovery. For XHTML you get literal downtime because the entire page shows a WSOD in the worst case.

Yes XHTML is okay as an internal tool, if for some reason your pipeline depends on parsing your own HTML then switching to XHTML internally could be a win. Just don't ship XHTML to browsers.

bazoom42 · on Sept 16, 2024

Surely a template engine would be able to produce valid (X)HTML?

Strict XHTML failed on the web because older browsers could not show it at all (since it used a different mime type) so nobody sane would use it. The problem wasn’t the strictness per se, the problem was how it was introduced without concern for backwards compatibility.

JavaScript is strict in the sense any syntax error will terminate execution. This seems to work fine because there is an incentive to make the syntax valid.

If XHTML was introduced in a backwards compatible way but new features (like canvas) only worked in strict mode, I’m sure it would have caught on. The incentives just have to be there.

bentley · on Sept 16, 2024

IE6’s refusal to display any page served with the XHTML MIME type was certainly the main reason nobody deployed real XHTML, but the overstrictness was not far behind. Hard enough to justify a complete rewrite of your website’s HTML; even harder when any encoding error or tag imbalance generated by your CMS would display the yellow screen of death rather than a best guess or even displaying everything up to the error:

https://commons.wikimedia.org/wiki/File:Yellow_screen_of_dea...

bazoom42 · on Sept 16, 2024

If there was an actual benefit to using XHTML I’m sure CMS’s would be updated to support it. It is not like it is an impossible problem to produce syntactically valid JSON or SVG for example.

As “use strict” in JavaScript shows, it is possible to introduce stricter parsing of an existing format, as long as it is explicit opt-in and existing content is unaffected.

notpushkin · on Sept 18, 2024

I think the main problem with CMSes supporting XHTML would be that basically every single one uses template engine that treats HTML as a string of characters.

Is there a templating system that’s easy to use (think Jinja or something Svelte-like), but parses templates as XML instead of just concatenating a bunch of strings?

dvdkon · on Sept 16, 2024

I think if XHTML was pushed forward, the second problem would be swiftly solved: We'd have a lot more systems that considered webpages as XML documents rather than just templated text. And text-based systems could easily validate their XHTML output and report failures quickly, as opposed to now where you get a broken page and have to specifically look if your HTML isn't malformed.

dmsnell · on Sept 16, 2024

For better or worse XHTML, also known as the XML serialization of HTML, cannot represent all valid HTML documents. HTML and XML are different languages with vastly different rules, and it's fairly moot now to consider replacing them.

Many of the "problems" with HTML are still handled adequately simply by using a spec-compliant parser instead of regular expressions, string functions, or attempting to parse HTML with XML parsers like PHP's `DOMDocument`.

Every major browser engine and every spec-compliant parser interprets any given HTML document in the same prescribed deterministic way. HTML parsers are't "loose" or "forgiving" - they simply have fully-defined behavior in the presence of errors.

This turned out to be a good thing because people tend to prefer being able to read _most_ of a document when _some_ errors are present. The "draconian error handling" made software easier to write, but largely deals with errors by pretending they can't exist.

baq · on Sept 16, 2024

Typescript+JSX is what XHTML wanted to be.

dietr1ch · on Sept 16, 2024

Clearly not the case as the point of a data language is to free you to pick a programming language to produce it, and a specification to allow agreement without a specific implementation in a particular language.

baq · on Sept 16, 2024

That’s exactly what happened. We write JSX which gets compiled down to assembly, excuse me, html5 or xhtml or whatever. Fine by me as long as we accept that writing it by hand is not what engineering time should be spent on in overwhelming majority of cases.

(I’d also like a word with yaml while we’re at it…)

qbane · on Sept 16, 2024

Like what most linters do? Linters are supposedly more strict on the syntax so that programmers can enforce their own rules.

bazoom42 · on Sept 16, 2024

It seems the deviations only make it more strict? It flags some valid HTML as errors.

But I agree about with you in genral about Postels Law.

throw10920 · on Sept 16, 2024

Yes, you nailed it with "the deviations only make it more strict" - that's why I think that it's reasonable.

I believe that, in general, when your implementation is deviating from a spec, you should have a good reason to do so, so you don't end up multiplying incompatible implementations - so you should at least think about why you're incompatible. I just think that this is a good reason.

tannhaeuser · on Sept 16, 2024

> but it deviates from it when it makes sense to do so

I guess it's ok to take liberties when the "official spec" (WHATWG HTML? which version? a W3C snapshot? older redacted W3C spec? or even MDN as the author is saying elsewhere?) has evolved over the course of ten or more years, but the base and where the model deviates should be documented, shouldn't it?

Doing so might also help with the issues found when this was first discussed [1], such as the bogus hgroup deprecation and possibly related h1-h6 end-element tags. For context, hgroup is not WHATWG-deprecated even though it was never part of W3C HTML until 2021 and marked deprecated according to the author's consulted MDN reference (which is not canonical). What has changed (in a backward-incompatible way!) is its content model due to its role changing from providing a mechanism to prevent HTML outlining in ToC/navigational content as outlining has been removed alltogether.

See the details in [3], and you can use the referenced SGML DTD grammar for checking against that particular (arguably bogus and not anymore deserving of the "HTML 5" moniker) HTML version/snapshot, as well as earlier WHATWG and W3C HTML snapshots/recommendations.

[1]: https://lobste.rs/s/o9khjn/first_html_lsp_reports_syntax_err...

[2]: https://sgmljs.net/docs/html230116.html

[3]: https://sgmljs.net/blog/blog2303.html