As a result of work I've been doing for illumos, I've recently gotten re-engaged with internationalization, and the support for this in libc and localedef (I am the original author for our localedef.)
I've decided that human languages suck. Some suck worse than others though, so I thought I'd write up a guide. You can take this as "your language sucks if...", or perhaps a better view might be "your program sucks if you make assumptions this breaks..."
(Full disclosure, I'm spoiled. I am a native speaker of English. English is pretty awesome for data-processing, at least at the written level. I'm not going to concern myself with questions about deeper issues like grammar, natural language recognition, speech synthesis, or recognition, automatic translation, etc. Instead this is focused strictly on the most basic display and simple operations like collation (sorting), case conversion, and character classification.)
So, there are some good examples of languages that are famously not broken.
a. English. Written English has simple sorting rules, and a very simple character set. Dipthongs are never ligatures. This is so useful for data processing that I think it has had a great deal to do with why English is the common language for computer scientists around the world. US-ASCII -- and English character set, is the "base" character set for Unicode, and pretty much all other encodings use ASCII encodings in the lower 7 bits.
b. Russian. (And likely others that use Cyrillic, but not all of them!) Russian has a very simple alphabet, strictly phonetic. The number of characters is small, there are no composite characters, and no special sorting rules. Hmm... I seem to recall that Russia (Soviet era) had a pretty robust computing industry. And these days Russians mostly own the Internet, right? Coincidence? Or maybe they just don't have to waste a lot of time fighting with the language just to get stuff done?
I think there are probably others. (At a glance, Geoergian looks pretty straight-forward. I suspect that there are languages using both Cyrillic and Latin character sets that are sane. Ethiopic actually looks pretty simple and sane too. (Again, just from a text processing standpoint.)
But sadly, the vast majority of natural languages have written forms & rules that completely and utterly suck for text processing.
I've decided that human languages suck. Some suck worse than others though, so I thought I'd write up a guide. You can take this as "your language sucks if...", or perhaps a better view might be "your program sucks if you make assumptions this breaks..."
(Full disclosure, I'm spoiled. I am a native speaker of English. English is pretty awesome for data-processing, at least at the written level. I'm not going to concern myself with questions about deeper issues like grammar, natural language recognition, speech synthesis, or recognition, automatic translation, etc. Instead this is focused strictly on the most basic display and simple operations like collation (sorting), case conversion, and character classification.)
1. Too many code points.
Some languages (from Eastern Asia) have way way too many code points. There are so many that these languages can't actually fit into 16-bits all by themselves. Yes, I'm saying that there are languages with over 65,000 characters in them! This explosion means that generating data for languages results in intermediate lookup tables that are megabytes in size. For Unicode, this impacts all languages. The intermediate sources for the Unicode supported in illumos blow up to over 2GB when support for the additional code planes is included.2. Your language requires me to write custom code for symbol names.
Hangul Jamo, I'm looking at you. Of all the languages in Unicode, only this one is so bizarre that it requires multiple lookup tables to determine the names of the characters, because the characters are made up of smaller bits of phonetic portions (vowels and consonants.) It even has its own section in the basic conformance document for Unicode (section 3.12). I don't speak Korean, but I had to learn about Jamo.3. Your language's character set is continuing to evolve.
Yes, that's Asia again (mostly China I think). The rate at which new Asian characters are added rivals that of updates to the timezone database. The approach your language uses is wrong!4. Characters in your language are of multiple different cell widths.
Again, this is mostly, but not exclusively, Asian languages. Asian languages require 2 cells to display many of their characters. But, to make matters far far worse, some times the number f code points used to represent a character is more than one, which means that the width of a character when displayed may be 0, 1, or 2 cells. Worse, some languages have both half- and full-width forms for many common symbols. Argh.5. The width of the character depends on the context.
Some widths depend on the encoding because of historical practice (Asia again!), but then you have composite characters as well. For example, a Jamo vowel sound could in theory be displayed on its own. But if it follows a leading consonant, then it changes the consonant character and they become a new character (at least to the human viewer).6. Your language has unstable case conversions.
There are some evil ones here, and thankfully they are rare. But some languages have case conversions which are not reversible! Case itself is kind of silly, but this is just insane! Armenian has a letter with this property, I believe.7. Your language's collation order is context-dependent.
(French, I'm looking at you!) Some languages have sorting orders that depend not just on the character itself, but on the characters that precede or follow it. Some of the rules are really hard. The collation code required to deal with this generally is really really scary looking.8. Your language has equivalent alternates (ligatures).
German, your ß character, which stands in for "ss", is a poster child here. This is a single code point, but for sorting it is equivalent to "ss". This is just historical decoration, because it's "fancy". Stop making my programming life hard.9. Your language can't decide on a script.
Some languages can be written in more than one script. For example, Mongolian can be written using Mongolian script or Cyrillic. But the winner (loser?) here is Serbian, which in some places uses both Latin and Cyrillic characters interchangeably! Pick a script already! I think the people who live like this are just schizophrenic. (Given all the political nonsense surrounding language in these places, that's no real surprise.)10. Your language has Titlecase.
POSIX doesn't do Titlecase. This happens because your language also uses ligatures instead of just allocating a separate cell and code point for each character. Most people talk about titlecase used in a phrase or string of words. But yes, titlecase can apply to a SINGLE CHARACTER. For example, Dž is just such a character.11. Your language doesn't use the same display / ordering we expect.
So some languages use right to left, which is backwards, but whatever. Others, crazy ones (but maybe crazy smart, if you think about it) use back and forth bidirectional. And still others use vertical ordering. But the worst of them are those languages (Asia again, dammit!) where the orientation of text can change. Worse, some cases even rotate individual characters, depending upon context (e.g. titles are rotated 90 degrees and placed on the right edge). How did you ever figure out how to use a computer with this crazy stuff?12. Your encoding collides control codes.
We use the first 32 or so character codes to mean special things for terminal control, etc. If we can't use these, your language is going to suck over certain kinds of communication lines.13. Your encoding uses conflicting values at ASCII code points.
ASCII is universal. Why did you fight it? But that's probably just me being mostly Anglo-centric / bigoted.14. Your language encoding uses shift characters.
(Code page, etc.) Some East Asian languages used this hack in the old days. Stateful encodings are JUST HORRIBLY BROKEN. A given sequence of characters should not depend on some state value that was sent a long time earlier.15. Your language encoding uses zero values in the middle of valid characters.
Thankfully this doesn't happen with modern encodings in common use anymore. (Or maybe I just have decided that I won't support any encoding system this busted. Such an encoding is so broken that I just flat out refuse to work with it.)Non-Broken Languages
So, there are some good examples of languages that are famously not broken.
a. English. Written English has simple sorting rules, and a very simple character set. Dipthongs are never ligatures. This is so useful for data processing that I think it has had a great deal to do with why English is the common language for computer scientists around the world. US-ASCII -- and English character set, is the "base" character set for Unicode, and pretty much all other encodings use ASCII encodings in the lower 7 bits.
b. Russian. (And likely others that use Cyrillic, but not all of them!) Russian has a very simple alphabet, strictly phonetic. The number of characters is small, there are no composite characters, and no special sorting rules. Hmm... I seem to recall that Russia (Soviet era) had a pretty robust computing industry. And these days Russians mostly own the Internet, right? Coincidence? Or maybe they just don't have to waste a lot of time fighting with the language just to get stuff done?
I think there are probably others. (At a glance, Geoergian looks pretty straight-forward. I suspect that there are languages using both Cyrillic and Latin character sets that are sane. Ethiopic actually looks pretty simple and sane too. (Again, just from a text processing standpoint.)
But sadly, the vast majority of natural languages have written forms & rules that completely and utterly suck for text processing.