I really don't see any advantage to UTF-32 when not-officially-standardised UTF-...

electroly · on June 12, 2023

UTF-32 is easy to understand for educational purposes but it's probably a mistake to use it as a real string representation, and almost nobody does. Code units and code points are the same thing in UTF-32 but they're different in UTF-16 and UTF-8; you can teach someone UTF-32 before they understand the distinction. Obviously, UTF-24 isn't used because it isn't a standard encoding, and if you really wanted to save memory, you'd use UTF-8 instead which is even more compact yet.

As for UTF-16, today the only reason people choose UTF-16 for new projects is because it's the native internal encoding of the ICU library. If you're not using ICU, it's pretty hard to defend anything but UTF-8.

skissane · on June 12, 2023

> I really don't see any advantage to UTF-32 when not-officially-standardised UTF-24 has the same constant 3-byte-sized codepoints (and multiplying by 3 is not hard - it's n + 2n); in UTF-32, the highest byte will never be anything other than 0, so it's essentially permanent waste.

Wouldn’t the difference in alignment (4 byte versus 3 bytes) make UTF-32 faster than UTF-24 in certain cases, on certain CPU architectures? So the always zero byte would be wasting space to gain greater performance.

mort96 · on June 12, 2023

People use UTF-32 sometimes because computers have 32-bit ints but not 24-bit ints. If you want a single primitive type to represent a code point, that's gonna be a 32-bit int. If you make an array of those, that's UTF-32.

kps · on June 12, 2023

> Incidentally, 18-bit architectures were once common: https://en.wikipedia.org/wiki/18-bit_computing

For those, there's UTF-18: https://www.ietf.org/rfc/rfc4042.txt

Findecanor · on June 12, 2023

To fit all currently assigned code points within 18 bits is easy: you would only have to move one range. Above 32FFF, all assigned code points are within E0000 to E01EF, which fits in-between 32FFF to 3FFFF with room to spare.

Those code points are used for flag emojis and for selecting uncommon CJK variants. If you don't support those, you could just strip out anything that doesn't fit in 18 bits to begin with.

Dylan16807 · on June 12, 2023

You'd have to abandon the two private use planes to use 18 bits, which would be a notable limitation.

dukoid · on June 12, 2023

"640K ought to be enough for anybody." -- Bill Gates