Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There are many reasons why UTF-8 is a better encoding but UTF-16 does at least have the benefit of being simpler. Every scalar value is either encoded as a single unit or a pair of units (leading surrogate + trailing surrogate).

However, Powershell (or more often the host console) has a lot of issues with handling Unicode. This has been improving in recent years but it's still a work in progress.



UTF-16 only makes sense if you were sure UCS-2 would be fine, and then oops, Unicode is going to be more than 16-bits and so UCS-2 won't work and you need to somehow cope anyway. It makes zero sense to adopt this in greenfield projects today, whereas Java and Windows, which had bought into UCS-2 back in the early-mid 1990s, needed UTF-16 or else they would need to throw all their 16-bit text APIs away and start over.

UTF-32 / UCS-4 is fine but feels very bloated especially if a lot of your text data is more or less ASCII, which if it's not literally human text it usually will be, and feels a bit bloated even on a good day (it's always wasting 11-bits per character!)

UTF-8 is a little more complicated to handle than UTF-16 and certainly than UTF-32 but it's nice and compact, it's pretty ASCII compatible (lots of tools that work with ASCII also work fine with UTF-8 unless you insist on adding a spurious UTF-8 "byte order mark" to the front of text) and so it was a huge success once it was designed.


As I said, there are many reasons UTF-8 is a better encoding. And indeed compact, backwards compatible, encoding of ASCII is one of them.


It is less compact than UTF-16 for CJK languages, FWIW.


That's only really true when the entire string is using CJK characters, but as far as I know, that's going to be fairly rare.

For example, this Chinese Wikipedia page is almost twice as big in UTF-16 as in UTF-8, because the vast majority of the content seems to be HTML tags:

  $ curl -s 'https://zh.wikipedia.org/wiki/中华人民共和国' | wc -c
  1877366
  $ curl -s 'https://zh.wikipedia.org/wiki/中华人民共和国' | iconv -f utf-8 -t utf-16 | wc -c
  3345724
If I strip out all the printable ASCII characters, it does indeed become about 33% more compact in UTF-16:

  $ curl -s 'https://zh.wikipedia.org/wiki/中华人民共和国' | tr -d '[:print:]' | wc -c
  311226
  $ curl -s 'https://zh.wikipedia.org/wiki/中华人民共和国' | tr -d '[:print:]' | iconv -f utf-8 -t utf-16 | wc -c
  213444
With gzip compression it is only about 10% more compact:

  $ curl -s 'https://zh.wikipedia.org/wiki/中华人民共和国' | tr -d '[:print:]' | gzip -9c | wc -c
  98373
  $ curl -s 'https://zh.wikipedia.org/wiki/中华人民共和国' | tr -d '[:print:]' | iconv -f utf-8 -t utf-16 | gzip -9c | wc -c
  89031
So I guess if you're archiving pure CJK text, maybe you could get a 10% benefit, though I suspect non-Unicode encodings of that text would be more compact anyway.


Microsoft did a comparison between UTF-8 and UTF-16 for Sql Server and found pretty much the same results.

I can't find the more detailed article, but this summarizes it: https://techcommunity.microsoft.com/t5/sql-server-blog/intro...

From what I can remember, UTF-8 consumes more CPU as it's more complex to process, has space savings for mostly ascii & European codepages, but can significantly bloat storage sizes for character sets that consistently require 3 or 4 bytes per character.


> There are many reasons why UTF-8 is a better encoding but UTF-16 does at least have the benefit of being simpler.

Big endian or little endian?


And did they handle surrogate pairs correctly?

My team managed a system that did a read from user data, doing input validation. One day we got a smart quote character that happened to be > U+10000. But because the data validation happened in chunks, we only got half of it. Which was an invalid character, so input validation failed.

In UTF-8, partial characters happen so often, they're likely to get tested. In UTF-16, they are more rarely seen, so things work until someone pastes in emoji and then it falls apart.


LOL


> There are many reasons why UTF-8 is a better encoding but UTF-16 does at least have the benefit of being simpler. Every scalar value is either encoded as a single unit or a pair of units (leading surrogate + trailing surrogate).

UTF16 is really not noticeably simpler. Decoding UTF8 is really rather straightforward in any language which has even minimal bit-twiddling abilities.

And that’s assuming you need to write your own encoder or decoder, which seems unlikely.


I have a MySQL database, which has large amount of Japanese text data. When I convert it from UTF8 to UTF16, it reduces certainly disk space.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: