wchar_t is a massive mistake that came from an era (the '90s) where people wante...

jesprenj · on June 13, 2023

Isn't this similar to what modern languages did? They abstracted away the underlying encoding so that the programmer deals with characters instead of bytes. Two examples are Python and Javascript strings. They kept ASCII-like code ergonomics (iterating and random access).

That of course required that they separated strings and byte-like objects -- like uint8_t and wchar_t.

Isn't signed char actually the culprit in modern C and therefore useless? It's main use is ASCII and that's obsolete.

qalmakka · on June 14, 2023

> so that the programmer deals with characters instead of bytes

This is a broken approach and a broken mindset. Modern languages like Rust (Python is not modern anymore, it is ATM 33 years old which is older than C was in 1990) reverted back to "a string is an array of uint8" because that's the only sane way to operate on them. Naive iteration and random access are broken unless they are performed on the underlying bytes, because iterating on Unicode characters is a *broken concept*.

Python strings are also arguably somewhat broken, because they still allow random access into a string at "character" (not byte) indexes which causes all sorts of issues when slicing. This means that code that works perfectly with English text will malfunction when handling other languages.

The hard truth is that slicing an Unicode string is a non-trivial and (somewhat) expensive operation, while Python slicing was designed in Python < 2.7 with the assumption char == byte, which now is always broken in every encoding except 8 bit, single codepage ones.

For instance:

  >>> print(s)
  'Crêpe'
  >>> [unicodedata.name(c) for c in s]
  ['LATIN CAPITAL LETTER C', 'LATIN SMALL LETTER R', 'LATIN SMALL LETTER E', 'COMBINING CIRCUMFLEX ACCENT', 'LATIN SMALL LETTER P', 'LATIN SMALL LETTER E']
  >>> len(s)
  6

In this case, the string "Crêpe" is in NFD form (all decomposable characters are decomposed; in particular `ê` is not U+00EA 'LATIN SMALL LETTER E WITH CIRCUMFLEX' but '\u0065\u0302', which is U+0065 'LATIN SMALL LETTER E' plus U+0302 'COMBINING CIRCUMFLEX ACCENT' ( ◌̂ )

In Rust, which is more modern, enforces UTF-8 (and not some broken version of UCS-2 or worse) slicing is ALWAYS done on bytes because it doesn't make sense to perform it on "codepoints". Asking for a slice with a byte range that falls in the middle of an UTF-8 multichar sequence will cause a panic; this also means that if you do s[2] you get the second _byte_, not "char". If you want the second character, you are forced to go through a Unicode library, as you always should.

Python will instead happily comply, basically returning only the codepoints you asked, because it sees a string composed of 6 Unicode codepoints despite the fact the user only sees 5 rendered on screen:

  >>> s[0:3]
  'Cre'
  >>> [unicodedata.name(c) for c in s[0:3]]
  ['LATIN CAPITAL LETTER C', 'LATIN SMALL LETTER R', 'LATIN SMALL LETTER E']

This makes string slicing basically useless, because even if you normalise all strings using `unicodedata.normalize('NFC', s)` before performing slicing on them, there are still several printable characters which are not represented with a single Unicode codepoint.

For instance,

  >>> eu = ''
  >>> len(eu)
  2
  >>> x[0:1]
  ''

because all flag emojis are represented with two Unicode codepoints, each one representing a letter.

TL;DR: do not use wchar_t, UTF-16, UTF-32, ... only use UTF-8 if possible and under all circumstances treat strings a special versions as black boxes specialised for text you can only access byte per byte. If you need to do text operations, use a library like ICU or whatever your language/repository provides.

jesprenj · on June 17, 2023

Thanks for the examples, your point makes sense to me now. Combining characters or modifier characters and all other weird aspects of unicode really call for a separate library for parsing unicode, since even with relying on 1 UTF-8 codepoint == 1 text unit, one doesn't get much advantage, because 1 character isn't always 1 UTF-8 codepoint.