Nothing, but Java's "char" type is always going to be 16-bit.

josephg · on Feb 8, 2022

Yep. In javascript (and Java and C# from memory) the String.length property is based on the encoding length in UTF16. It’s essentially useless. I don’t know if I’ve ever seen a valid use for the javascript String.length field in a program which handles Unicode correctly.

There’s 3 valid (and useful) ways to measure a string depending on context:

- Number of Unicode characters (useful in collaborative editing)

- Byte length when encoded (these days usually in utf8)

- and the number of rendered grapheme clusters

All of these measures are identical in ASCII text - which is an endless source of bugs.

Sadly these languages give you a deceptively useless .length property and make you go fishing when you want to make your code correct.

account42 · on Feb 9, 2022

> and the number of rendered grapheme clusters

This is also rarely useful unless you are working with a monospace font where all grapheme clusters have the same width, which is probably none if you support double-width characters. More likely what you are interested in is the display length with a particular font or column count with a monospace font.

tialaramex · on Feb 8, 2022

Java's char is a strong competitor for most stupid "char" type award.

I would give it to Java outright if not for the fact that C's char type doesn't define how big it is at all, nor whether it is signed. In practice it's probably a byte, but you aren't actually promised that, and even if it is a byte you aren't promised whether this byte is treated as signed or unsigned, that's implementation dependant. Completely useless.

For years I thought char was just pointless, and even today I would still say that a high level language like Java (or Javascript) should not offer a "char" type because the problems you're solving with these languages are so unlikely to make effective use of such a type as to make it far from essential. Just have a string type, and provide methods acting on strings, forget "char". But Rust did show me that a strongly typed systems language might actually have some use for a distinct type here (Rust's char really does only hold the 21-bit Unicode Scalar Values, you can't put arbitrary 32-bit values in it, nor UTF-16's surrogate code points) so I'll give it that.

mark-r · on Feb 8, 2022

The only guarantee that C gives you is that sizeof char == 1, and even that's not as useful as it looks.

SAI_Peregrinus · on Feb 8, 2022

It also guarantees that char is at least 8 bits.

account42 · on Feb 9, 2022

And POSIX does guarantees that CHAR_BIT == 8 so in practice this is only a concern on embedded platform where you are only dealing with "C-ish" anyway.

SAI_Peregrinus · on Feb 9, 2022

Or any other non-POSIX system. I'm not sure what Windows guarantees, but it's definitely not a POSIX system.

account42 · on Feb 11, 2022

How many non-embedded non-POSIX systems do you know? Windows also guarantees CHAR_BIT == 8 and since most software is first written for Windows or POSIX there is plenty of software that assumes that CHAR_BIT == 8. That means that anything that will want to run general software needs to also ensure CHAR_BIT == 8 - not to mention all the algorithms and data formats designed around you being able to efficiently access octets. The only platforms that can get away with CHAR_BIT != 8 are precisely those that have software specially written for them, i.e. embedded systems.

saagarjha · on Feb 9, 2022

You’re being far too harsh. The Java char type isn’t “stupid”; really, it’s just unfortunate in hindsight. There are plenty of decisions that were stupid even at the time they were decided, and this isn’t that: people actually thought that 2 bytes was enough for all characters, and Han unification was going to work. Looking backward this is “obviously” futile but certainly not then.

C’s character type, FWIW, has a use: it more or less indicates the granularity that is efficiently addressable by the host architecture. Trying to use it for more than that is generally not that fruitful, but it definitely has a purpose and it’s pretty good at that.

Finally, speaking of unfortunate decisions, Rust happens to make one that I don’t particularly like: it lets you misalign characters (and panics), which is…not great. It would be much nicer if the view just don’t let you do this unless you specifically asked for bytes or something.

tialaramex · on Feb 9, 2022

When Java was first conceived UTF-16 didn't exist, but we shouldn't rewrite history entirely here, Java 1.0 and Unicode 2.0 (with UTF-16) are from the same year. It would have been wiser (albeit drastic) to pull char in Java 1.0, reserve the word char and the literal syntax and spend a year or two deciding what you actually wanted here in light of the fact Unicode is not going to be a 16-bit encoding.

And again, I don't think Java probably needed 'char' at all, it's the sort of low-level implementation detail Java has been trying to escape from so this is a needless self-inflicted wound. I think there's a char in Java for the reason it has both increment operators - C does it and Java wants to look like C so as not to scare the C programmers.

C's unsigned char could just be named "byte" and if signed char must exist, call that "signed byte". The old C standard actually pretends these are characters, which of course today they clearly aren't which is why this is a thread about UTF-8. I don't have any objection to a byte type especially in a low-level language.

Presumably your Rust annoyance is related to things like String::insert? But I don't understand how this problem arises, if you are inserting characters at random positions in a String, that's just going to be nonsense. I can't conceive of a situation where I want to insert characters (or sub-strings) unless I know where they're supposed to go exactly relative to what is in the string already, whereupon it won't panic.

saagarjha · on Feb 10, 2022

I don’t get your argument at all. People want characters from their strings, and around the time Java decided on UTF-16 because at the time it seemed like the “right” way to do Unicode. What would you suggest they have adopted back then? Similarly C’s char type is named “char” because people dealt with ASCII back then and characters used to be a byte. It turns out that sucks but being able to do byte arithmetic is cool so it’s still around for that purpose (and C++ actually has added std::byte for exactly this; perhaps C will get it as well at some point). For Rust, this is just a thing about holding it wrong: the operation is generally not relevant, so why even expose it? It doesn’t make sense to allow for random indexing if you’re just going to crash on misalignment. It would be better to just have an API that doesn’t allow misalignment at all: see Swift’s implementation for example.

tialaramex · on Feb 10, 2022

People should stop wanting "characters from their strings" especially in the sort of high level software you'd attempt in Java - and Java was in a good position to do that the way we've successfully done it for similar things, by not providing the misleading API shape. Reserve char but don't implement it is what I'm saying, like goto.

Compare for example decryption, where we learned not to provide decrypt(someBytes) and checkIntegrity(someBytes) even though that's what People often want, it's a bad idea. Instead we provide decrypt(wholeBlock) and you can't call it until you've got a whole block we can do integrity checks on, it fails without releasing bogus plaintext if the block was tampered with. An entire class of stupid bugs becomes impossible.

Java should have provided APIs that work on Strings, and said if you think you care about the things Strings are made up of, either you need a suitable third party API (e.g. text rendering, spelling) or you want bytes because that's how Strings are encoded for transmission over the network or storage on disk. You don't want to treat the string as a series of "characters" because they aren't.

The idea that a String is just a vector of characters is wrong, that's not what it is at all. A very low level language like C, C++ or Rust can be excused for exposing something like that, because it's necessary to the low-level machinery, but almost nobody should be programming that layer.

Imagine if Java insisted on acting as though your Java references were numbers and that it could make sense to add them together. Sure in fact they are pointers, and the pointer is an integral type and so you could mechanically add them together, but that's nonsense, you would never write code that needs to do this in Java.

K&R C claimed that char isn't just for representing "ASCII" (which wasn't at that time set in stone as the encoding you'll be using) but for representing the characters on the system you're programming regardless of whether they're ASCII. 'A' wasn't defined as 65 but as whatever the code happens to be for A on your computer. Presumably the current ISO C doesn't make the same foolish claim.

dahfizz · on Feb 9, 2022

I think you're being too harsh on the C char. It is guaranteed sizeof(char) == 1, and it is guaranteed to be at least 8 bits long, i.e. long enough for any ascii character.

These requirements are perfectly good for the needs of a CHARacter type. If you need to control signed / unsigned because you want to use the char as a small integer, you can specify yourself whether it is signed or not.

In reality, where chars are used to store ASCII, the signdness of the datatype is meaningless because the highest bit is never set.