As a beginner programmer back in the day I'd have agreed that unicode is weird and "length in characters" is the right metric for the database entry.
However as a senior dev, who have read and messed around enough with binary, UTF-X, compression/GZIP, etc, I'd say that "character length" for database field size is a weird concept and that "size in bytes" would make more sense since that maps better to what you have in the HDD/SSD/network.
Everyone replying here seems to be quoting "business reasons" which seem very English-centric. Try to talk to a Japanese or Korean person about their "business reasons".
I am suffering the opposite problem. My name has 36+ letters and I live in Japan, where the average full name has "3-4 characters" (Kanji) so they tend to be safe and allow for 10-15 characters, if I'm lucky maybe even 10 for given and 10 for family names (20 in total), where I don't have to totally murder my name and just need to amputate it.
Typically there should be a maximum size in bytes for performance reasons (e.g. how many rows fit in a page) and/or a minimum size in "characters" (very ill defined but usually approximated as code points, with some slack to allow a reasonable amount of combining marks in addition to the desired number of letters), with no guarantee that the two sizes are compatible.
I've often found in practice short UTF-8 columns with a bytes length, that "accidentally" truncated text with ctastrophic effects after the application checked value lengths in characters.
If you want a constraint for such a column, then due to nature of our complex writing, it has to be separate from the actual field type and physical storage allocation - it's the equivalent of an integer field having a constraint that it must be between 1 and 100.
Is there ever a genuine, non-arbitrary business requirement to limit any string to a certain number of unicode characters? And by characters here we probably mean extended grapheme clusters.
If the data is going to get printed in a monospace font, on a passport or credit card or something, then i can understand a limit on the number of characters. But then it's not full unicode either - you want to constrain the length in some specific character set.
Otherwise, i think limits are always arbitrary. I suspect there is a strong cultural holdover from the days of punched cards here.
>> If the data is going to get printed in a monospace font, on a passport or credit card
In the case of limited space, you absolutely can count characters and limit them in the UI. For data storage though you would provision "enough bytes". Think approximately 4 bytes per character, plus a bit extra just in case.
But I agree in most cases the length limits are completely arbitrary and simply made up by the programmer at the time the database is designed. (Hint: They are _always_ too short, especially if the programmer lacks experience.)
Business constraints should be at the server layer (eg api constraints), not the db layer.
You want the name to be max 200 characters, the birth year to be minimum 1900, and the email address to not contain the domain name hotmail.com? Don’t do all this at the db layer! What if you want some users to have different constraints, eg selectively disable some of them?
Nobody cares about them, I am afraid. Most people _really_ can't accept that you can't work on Unicode like you did with ASCII, and do not want to let go of their old C habits (like iterating char by char and doing something).
It makes it less intuitive for the user if you have input fields limited to n bytes instead of to n characters (complications caused by combining characters notwithstanding).
Twitter used(?) to have this problem: emojis, not being in the BMP, would count as two "characters" because they were encoded as UTF-16 surrogate pairs.
However as a senior dev, who have read and messed around enough with binary, UTF-X, compression/GZIP, etc, I'd say that "character length" for database field size is a weird concept and that "size in bytes" would make more sense since that maps better to what you have in the HDD/SSD/network.