I really don't see any advantage to UTF-32 when not-officially-standardised UTF-24 has the same constant 3-byte-sized codepoints (and multiplying by 3 is not hard - it's n + 2n); in UTF-32, the highest byte will never be anything other than 0, so it's essentially permanent waste.
Also, according to https://en.wikipedia.org/wiki/List_of_Unicode_characters there's currently less than 150k codepoints defined, so even 21 bits is several times larger than necessary --- 18 bits will contain all the currently assigned codepoints, and be sufficient until 256k codepoints is reached.
UTF-32 is easy to understand for educational purposes but it's probably a mistake to use it as a real string representation, and almost nobody does. Code units and code points are the same thing in UTF-32 but they're different in UTF-16 and UTF-8; you can teach someone UTF-32 before they understand the distinction. Obviously, UTF-24 isn't used because it isn't a standard encoding, and if you really wanted to save memory, you'd use UTF-8 instead which is even more compact yet.
As for UTF-16, today the only reason people choose UTF-16 for new projects is because it's the native internal encoding of the ICU library. If you're not using ICU, it's pretty hard to defend anything but UTF-8.
> I really don't see any advantage to UTF-32 when not-officially-standardised UTF-24 has the same constant 3-byte-sized codepoints (and multiplying by 3 is not hard - it's n + 2n); in UTF-32, the highest byte will never be anything other than 0, so it's essentially permanent waste.
Wouldn’t the difference in alignment (4 byte versus 3 bytes) make UTF-32 faster than UTF-24 in certain cases, on certain CPU architectures? So the always zero byte would be wasting space to gain greater performance.
People use UTF-32 sometimes because computers have 32-bit ints but not 24-bit ints. If you want a single primitive type to represent a code point, that's gonna be a 32-bit int. If you make an array of those, that's UTF-32.
To fit all currently assigned code points within 18 bits is easy: you would only have to move one range.
Above 32FFF, all assigned code points are within E0000 to E01EF, which fits in-between 32FFF to 3FFFF with room to spare.
Those code points are used for flag emojis and for selecting uncommon CJK variants. If you don't support those, you could just strip out anything that doesn't fit in 18 bits to begin with.
Also, according to https://en.wikipedia.org/wiki/List_of_Unicode_characters there's currently less than 150k codepoints defined, so even 21 bits is several times larger than necessary --- 18 bits will contain all the currently assigned codepoints, and be sufficient until 256k codepoints is reached.
Incidentally, 18-bit architectures were once common: https://en.wikipedia.org/wiki/18-bit_computing