Friday, October 17, 2014

Your language sucks...

As a result of work I've been doing for illumos, I've recently gotten re-engaged with internationalization, and the support for this in libc and localedef (I am the original author for our localedef.)

I've decided that human languages suck.  Some suck worse than others though, so I thought I'd write up a guide.  You can take this as "your language sucks if...", or perhaps a better view might be "your program sucks if you make assumptions this breaks..."

(Full disclosure, I'm spoiled.  I am a native speaker of English.  English is pretty awesome for data-processing, at least at the written level.  I'm not going to concern myself with questions about deeper issues like grammar, natural language recognition, speech synthesis, or recognition, automatic translation, etc.  Instead this is focused strictly on the most basic display and simple operations like collation (sorting), case conversion, and character classification.)

1. Too many code points. 

Some languages (from Eastern Asia) have way way too many code points.  There are so many that these languages can't actually fit into 16-bits all by themselves.  Yes, I'm saying that there are languages with over 65,000 characters in them!  This explosion means that generating data for languages results in intermediate lookup tables that are megabytes in size.  For Unicode, this impacts all languages.  The intermediate sources for the Unicode supported in illumos blow up to over 2GB when support for the additional code planes is included.

2. Your language requires me to write custom code for symbol names. 

Hangul Jamo, I'm looking at you.  Of all the languages in Unicode, only this one is so bizarre that it requires multiple lookup tables to determine the names of the characters, because the characters are made up of smaller bits of phonetic portions (vowels and consonants.)  It even has its own section in the basic conformance document for Unicode (section 3.12).  I don't speak Korean, but I had to learn about Jamo.

3. Your language's character set is continuing to evolve. 

Yes, that's Asia again (mostly China I think).   The rate at which new Asian characters are added rivals that of updates to the timezone database.  The approach your language uses is wrong!

4. Characters in your language are of multiple different cell widths. 

Again, this is mostly, but not exclusively, Asian languages.  Asian languages require 2 cells to display many of their characters.  But, to make matters far far worse, some times the number f code points used to represent a character is more than one, which means that the width of a character when displayed may be 0, 1, or 2 cells.   Worse, some languages have both half- and full-width forms for many common symbols.  Argh.

5. The width of the character depends on the context. 

Some widths depend on the encoding because of historical practice (Asia again!), but then you have composite characters as well.  For example, a Jamo vowel sound could in theory be displayed on its own.  But if it follows a leading consonant, then it changes the consonant character and they become a new character (at least to the human viewer).

6. Your language has unstable case conversions.

There are some evil ones here, and thankfully they are rare.  But some languages have case conversions which are not reversible!  Case itself is kind of silly, but this is just insane!  Armenian has a letter with this property, I believe.

7. Your language's collation order is context-dependent. 

(French, I'm looking at you!)  Some languages have sorting orders that depend not just on the character itself, but on the characters that precede or follow it.  Some of the rules are really hard.  The collation code required to deal with this generally is really really scary looking.

8. Your language has equivalent alternates (ligatures). 

German, your ß character, which stands in for "ss", is a poster child here.  This is a single code point, but for sorting it is equivalent to "ss".  This is just historical decoration, because it's "fancy".  Stop making my programming life hard.

9. Your language can't decide on a script. 

Some languages can be written in more than one script.  For example, Mongolian can be written using Mongolian script or Cyrillic.  But the winner (loser?) here is Serbian, which in some places uses both Latin and Cyrillic characters interchangeably! Pick a script already! I think the people who live like this are just schizophrenic.  (Given all the political nonsense surrounding language in these places, that's no real surprise.)

10. Your language has Titlecase. 

POSIX doesn't do Titlecase.  This happens because your language also uses ligatures instead of just allocating a separate cell and code point for each character.  Most people talk about titlecase used in a phrase or string of words.  But yes, titlecase can apply to a SINGLE CHARACTER.  For example, Dž is just such a character.

11. Your language doesn't use the same display / ordering we expect.

So some languages use right to left, which is backwards, but whatever.   Others, crazy ones (but maybe crazy smart, if you think about it) use back and forth bidirectional.  And still others use vertical ordering.  But the worst of them are those languages (Asia again, dammit!) where the orientation of text can change.  Worse, some cases even rotate individual characters, depending upon context (e.g. titles are rotated 90 degrees and placed on the right edge).  How did you ever figure out how to use a computer with this crazy stuff?

12. Your encoding collides control codes.

We use the first 32 or so character codes to mean special things for terminal control, etc.  If we can't use these, your language is going to suck over certain kinds of communication lines.

13. Your encoding uses conflicting values at ASCII code points.

ASCII is universal.  Why did you fight it?  But that's probably just me being mostly Anglo-centric / bigoted.

14. Your language encoding uses shift characters. 

(Code page, etc.)  Some East Asian languages used this hack in the old days.  Stateful encodings are JUST HORRIBLY BROKEN.   A given sequence of characters should not depend on some state value that was sent a long time earlier.

15. Your language encoding uses zero values in the middle of valid characters. 

Thankfully this doesn't happen with modern encodings in common use anymore.  (Or maybe I just have decided that I won't support any encoding system this busted.  Such an encoding is so broken that I just flat out refuse to work with it.)

Non-Broken Languages


So, there are some good examples of languages that are famously not broken.

a. English.  Written English has simple sorting rules, and a very simple character set.  Dipthongs are never ligatures.  This is so useful for data processing that I think it has had a great deal to do with why English is the common language for computer scientists around the world.  US-ASCII -- and English character set, is the "base" character set for Unicode, and pretty much all other encodings use ASCII encodings in the lower 7 bits.

b. Russian.  (And likely others that use Cyrillic, but not all of them!)  Russian has a very simple alphabet, strictly phonetic.  The number of characters is small, there are no composite characters, and no special sorting rules.  Hmm... I seem to recall that Russia (Soviet era) had a pretty robust computing industry.  And these days Russians mostly own the Internet, right?  Coincidence?  Or maybe they just don't have to waste a lot of time fighting with the language just to get stuff done?

I think there are probably others.  (At a glance, Geoergian looks pretty straight-forward.   I suspect that there are languages using both Cyrillic and Latin character sets that are sane.  Ethiopic actually looks pretty simple and sane too.  (Again, just from a text processing standpoint.)

But sadly, the vast majority of natural languages have written forms & rules that completely and utterly suck for text processing.

20 comments:

András said...

You think collation in French is hard? Try Hungarian! :)

The Hungarian alphabet contains several "double letters" (cs, dz, gy, ly, ny, sz, ty, zs) and a "triple letter" (dzs), but these don't have their own code points -- they're just represented by their constituent letters in writing. However, they have an impact on collation and, by inference regular expression matching (think [a-z], which doesn't include "zs" because "zs" is sorted after z!).

The official collation order is:

A, Á, B, C, Cs, D, Dz, Dzs, E, É, F, G, Gy, H, I, Í, J, K, L, Ly, M, N, Ny, O, Ó, Ö, Ő, P, Q, R, S, Sz, T, Ty, U, Ú, Ü, Ű, V, W, X, Y, Z, Zs

(Luckily, the language doesn't insist on title case -- normally all constituent characters of all collation symbols are capitalized.)

This means that words starting with "cs" are sorted after words starting with e.g. "cu", and that /^[a-c]/ shouldn't match e.g. "csók" (kiss) because it starts with a cs, not a c.

But this is where it gets _really_ interesting, because these special letter combinations can also occur (especially in compound words) without being a "single letter (collation symbol) represented by two characters". For example, "mézsör" contains a z, followed by an s, as does "őzsuta"; "pácsó" contains a c followed by an s (so that "pácsó" would be sorted between "pácra" and "páctól", not between pácsr* and pácst*, where c and s represent the collating symbol "cs").

Another pathological example is "egészség" (health -- literally "wholeness"), which contains an "sz" followed by a "z", not an "s" followed by a "zs" -- and the only way to know that is from a dictionary. "Rézszínű" ("copper coloured") is a compound word that may not even appear in dictionaries, so how do you guess wether it has zs-s or z-sz? (In case you're wondering, it's the latter.)

I'm fairly certain there are even words that can be read two ways and you have to infer which one it is based on context, which makes regex matching fun, to say the least. And if someone makes a sick pun that depends on _both_ readings...

András said...

You think collation in French is hard? Try Hungarian! :)

The Hungarian alphabet contains several "double letters" (cs, dz, gy, ly, ny, sz, ty, zs) and a "triple letter" (dzs), but these don't have their own code points -- they're just represented by their constituent letters in writing. However, they have an impact on collation and, by inference regular expression matching (think [a-z], which doesn't include "zs" because "zs" is sorted after z!).

The official collation order is:

A, Á, B, C, Cs, D, Dz, Dzs, E, É, F, G, Gy, H, I, Í, J, K, L, Ly, M, N, Ny, O, Ó, Ö, Ő, P, Q, R, S, Sz, T, Ty, U, Ú, Ü, Ű, V, W, X, Y, Z, Zs

(Luckily, the language doesn't insist on title case -- normally all constituent characters of all collation symbols are capitalized.)

This means that words starting with "cs" are sorted after words starting with e.g. "cu", and that /^[a-c]/ shouldn't match e.g. "csók" (kiss) because it starts with a cs, not a c.

But this is where it gets _really_ interesting, because these special letter combinations can also occur (especially in compound words) without being a "single letter (collation symbol) represented by two characters". For example, "mézsör" contains a z, followed by an s, as does "őzsuta"; "pácsó" contains a c followed by an s (so that "pácsó" would be sorted between "pácra" and "páctól", not between pácsr* and pácst*, where c and s represent the collating symbol "cs").

Another pathological example is "egészség" (health -- literally "wholeness"), which contains an "sz" followed by a "z", not an "s" followed by a "zs" -- and the only way to know that is from a dictionary. "Rézszínű" ("copper coloured") is a compound word that may not even appear in dictionaries, so how do you guess wether it has zs-s or z-sz? (In case you're wondering, it's the latter.)

I'm fairly certain there are even words that can be read two ways and you have to infer which one it is based on context, which makes regex matching fun, to say the least. And if someone makes a sick pun that depends on _both_ readings...

Mithaldu said...

Your point about german is incorrect. It is acceptable to replace ß with ss in computing environments. (Because we germans know the english computocracy made too many things that can't handle it.)

They are however not the same letter or sound. Example:

Fass (barrel) has a short a

Fuß (foot) has a long u

That is because double consonants in german shorten the preceding vowel, and while ß is in pronounciation a combination of s and z (not s and s), it is a single consonant.

题叶 said...

No, Chinese characters are limited, though there is a large plenty of them. But we never create new characters now. And out IME soution does not allow us to do that, while English people just combine words to create ones.

Michael Richter said...

"ASCII is universal. Why did you fight it? But that's probably just me being mostly Anglo-centric / bigoted."

The AMERICAN Standard Code for Information Interchange is "universal". Huh. I see.

So, I'm left with a few options here:
1.You are incredibly stupid.
2.You are incredibly ignorant.
3.You are incredibly delusional.

Which one do you think is the kindest interpretation?

Peter Jeschke said...

Point 8) No, ß and ss are not equivalent. They are used in different cases and influence the pronunciation. (An example - Correct: Spaß (Fun) Incorrect: Spass. There's no such word)

Many people write ss if they can't write ß for some reason (eg they don't have the key on their keyboard). But that's still wrong, just accepted.

ivanhoe said...

Actually Russian is not strictly phonetic, they have 2 letters called soft and hard sign, which are used only to affect how the letter before them is pronounced. On the other hand Serbian Cyrillic alphabet is indeed 100% phonetic, and unlike what you claim it's still pretty easy for automatic processing, even if many Serbs use latin alphabet to write, because detection and transliteration between the two alphabets is quite easy, especially from cyrillic to latin (you can directly replace letters one by one). Also nothing schizophrenic about the two alphabets, honestly, it's simply a reminiscence from the days of Yugoslav Federation, where Serbs had to communicate on a daily bases with people from other parts of Yugoslavia who were using only the latin alphabet... being on the same (code) page simplifies things :)

Garrett D'Amore said...

Thanks for pointing out the difference between ß and "ss". I seem to recall they were equivalent in the collation order, but now I'm having trouble finding it. (The new collation input files are huge, and painful to work with.)

Garrett D'Amore said...

With respect to ASCII. ASCII is universal now. It is also known as ISO646, which is an ISO (International) standard.

ASCII sucks for people who need letters that fall outside of it. But having colliding code points makes things hard, given that these days ~the world relies on ASCII.

(And yes -- pretty much all the Internet standards started with ASCII, though some now accept UTF-8. But originally HTML, domain names, RFC 822, etc. all could only be encoded in ASCII. For many countries, ISO-8859 standards added extensions to the character set for different languages, even Turkish has an ISO 8859 standard that doesn't collide with ASCII (8859-9).

In fact, even Russian, using KOI8-R, offers an 8-bit character set that leaves ASCII in the lower order bits.

Unicode does this too. UTF-8 is a strict superset of ASCII. So does EUC (extended Asian encodings).

So yes, ASCII *is* universal.

We can argue whether this is a result of bigotry, accident, or other causes. As I said earlier, if your encoding system conflicts with ASCII, its broken. And the reason is all those other universal things -- like oh, e-mail.

So, maybe you hate America, but hating on ASCII is just plain stupid.

Garrett D'Amore said...

Oh by the way, while my use of ß was unfortunate as an example, there are definitely others. Most of them are compounds with accents or ligatures like ṻ or Æ. (And not just in latin either: Greek has them too, an example: ώ.) In some languages these characters are unique and distinct from the separation, but often they aren't, and the single code point can stand be decomposed into multiple components. (This actually leads to a whole standard around "normalization forms"...)

Peter said...

Canadian English has special rules for collation -- the phone books (at least in the days when they were printed on dead trees) have a separate section for names starting with "Mac" and "Mc", and these both sort together (e.g., "Macdonald" and "McDonald" are both considered the same)

Garrett D'Amore said...

Russian hard and soft signs (ъ and ь respectively) probably do violate strict phonetical rules -- since they affect pronunciation of an earlier letter. But, they generally have no impact on text processing (sorting).

If a given language uses a script form consistently its no problem. But when users mix and match from two scripts interchangeably (so that Г and G can be used interchangeably for example), it does horrible things for sorting, etc. You have two equivalent forms (for collation) with different code points. Presumably when you convert upper to lower case, it just works, although there can be confusion. Is "m" a lower case latin "M", or is it a lower case Cyrilic "T". Fortunately in the code points the two forms have separate identities, although visually they can be impossible to distinguish from one another.)

For POSIX, we don't support mixing both Cyrillic and Latin -- a locale must choose one or the other as a primary (though you can use both, they don't sort identically, and message catalogs will exclusively use one script or the other, depending on which is chosen.)

Garrett D'Amore said...

Btw, in case you're wondering, if I were to design a language for text processing, it would probably look a lot like English. (Actually, in this era, with ASCII *everywhere*, it would *really* look a lot like English.)

What I'd change in English is to get rid of case. Case sucks. Many languages dispense with case entirely and are better off for it.

Probably also I'd nuke articles. While I have no problem with them, Russian does fine without them. And they create problems because we insist on sorting certain things with special handling for articles -- many algorithms ignore articles for sorting purposes.)

Otherwise English works really really well. Russian is slightly better off without those articles, too!

One thing that I would definitely make certain of is to ensure that accent characters were not used at all. If you want to use a different letter -- just use a different letter, don't use some accent character to modify it.

(Hmm... I wonder, does International Phoenetic Alphabet -- IPA -- meet these criteria already? I should check it out.)

Russ Williams said...

This rant is very tautological. ASCII is dominant because English speakers were dominant when computer character sets were first codified. If e.g. Polish speakers had been dominant when computer character sets were first codified, then by the same argument you'd need to say that English sucks since a hypothetical Polish-created "ASCII" might not contain Q and V and X, for example, which English needs.

Considering diacritical marks to be somehow inherently problematic or harder is completely English-centric; if ASCII had been written by Poles, it would have ą ę ł ć ś ź ż ó as fundamental distinct characters along with a e l c s z o. That ć and c look similar (one visually contains the other) has as much significance as the fact that b and l or o look similar in English (b visually contains l and o). Whatever language was used to define the core (single-byte) character set would be considered normal & convenient. There's nothing inherently more difficult about ą as a letter than a as a letter. They are simply 2 distinct letters.

You also seem a bit blind to quirks of English (or biased and more willing to forgive them) if you regard English as "pretty awesome for data-processing". E.g. special case coding needed for various irregular nouns and verbs in text output. (Or even needing to use a different form for plurals: "1 result" vs "2 results" requires more coding.) E.g. various spelling differences between US & British English.

Which is not to say that there aren't plenty of worse annoyances in some other languages. But the idea that English is "awesome" for data processing seems dubious, and very dependent on the historical accident that ASCII was created by English speakers, so by no inherent virtue of English itself it of course is most conveniently represented in ASCII.

Richard Cobbe said...

I sympathize with your difficulties in processing non-ASCII text, although you've got a much harder problem than I ever had -- I was just trying to deal with classical Greek.

There are a couple of factual points that you might find interesting, although they don't help with the underlying problem very much.

First, I'm not sure it's really straightforward to say that languages shouldn't use diacriticals but should instead use different letters. There are some cases, as in French, where e and é are pronounced differently, so different letters (or digraphs) might make sense -- though there are many cases in French where the accent doesn't indicate different pronunciation: "ou" and "où" mean "or" and "where," respectively, but they're pronounced identically.

In (modern) Greek, however, the accent on ώ indicates that this is the syllable that gets the stress, rather than a modification of the vowel sound. Maybe I'm just used to the traditional orthographies for these things, but it feels really strange to me to use different letters to indicate a supra-segmental property like stress.

Second, while IPA may look promising, it introduces lots of complications of its own. Do you go for a phonemic spelling, or a phonetic spelling? Or, to take an example from English, how do you spell "nuclear"? [nuklir] or [nukjəlɚ]? (Or is that a vocalic [l] in the second example?) I use the first pronunciation exclusively, but other people use the second -- and that's entirely within the constraints of American English. Supporting multiple dialects and accents only adds to the complexity.

kaishakunin said...

You are wrong about Russian, we do have special context-dependent collation rules. Namely, Russian letter "ё" is sorted after "е" if it is the last letter or if all of the following letters are the same; otherwise, it is treated equivalent to "е".

Here's an extract from my test cases for Unicode collation:

1) "ё", "е" -> sorting -> "е", "ё"
Here "ё" is sorted after "е" because it is the last letter.

2) "ёлка", "еда", "ель" -> sorting -> "еда", "ёлка", "ель"
Here "ё" is sorted equivalent to "е" because it is not the last letter and the subsequent letters are different.

Garrett D'Amore said...

Huh. So I didn't know that about Russian collation. So scratch it from the list of non-broken languages.

I don't think we collate English in the US the same way, but I could be mistaken. (Reading the rules for collation from the CLDR is non-trivial. At least after they have been turned into localedef grammar.)

Basically, its starting to look like *all* Natural languages suck, at least in some form. Its just some suck worse than others. (Again, this is in the context of text processing.)

I wonder about esperanto. It was an invented language; but I bet its inventors gave no thought to usefulness/ease for computing applicatons. (It being driven by political considerations rather than pragmatic considerations.)

Garrett D'Amore said...

Also, for the record, yes, I avoided grammar, which includes rules for pluralization, verbs, etc

I think pretty much *all* languages become terrible if you start to consider grammar considerations; especially since even the most strongly rule based languages (German?) still have *some* exceptions. (English is better because it has fewer and simpler rules than many European languages; but its worse because it has far more exceptions than most.)

I have no idea about non-European language grammars; I suspect that I'm just blissfully ignorant. Gosh, given the other challenges some of those languages have (tones, and character sets with thousands -- tens of thousands -- of glyphs), I would hope that they would have much much simpler grammars.

Russ Williams said...

Esperanto of course was indeed created before electronic computers. But it is by design much more regular than English and other ethnic/national languages. It has literally no irregular nouns or verbs, for example. And its orthography is much more rational (especially than that of English), with basically one letter = one phoneme. (Of course you will complain that some of its letters ĉĝĥĵŝŭ are not in ASCII, by accident of history.)

So it is arguably much better suited for computer processing (ignoring the ASCII issue, which by default makes English about the only suitable language if ASCII is your top priority.)

Indeed it has been used as a bridge/hub language in some translation projects, perhaps the most successful having been DLT by Toon Witkam.

I have some direct personal experience with minor text processing in Esperanto, as I wrote a very simple program to verify the syllables and accents in a very long epic poem to help out someone translating it from Vietnamese. Such a program would have been far more complex for English, requiring a dictionary database of the pronunciation of words, whereas in Esperanto the pronunciation is completely reliably deducible from the spelling.

Garrett D'Amore said...

So ASCII is a historical accident. I have no qualm with languages different glyphs; although I object when there are thousands of them (non-phonetic languages).

In today's world, most languages will fit inside BMP of Unicode, and that's good enough for me. :-)

I do object when people use accents to modify characters but don't use a full separate code point for the new character. If we treat these as unique characters rather than composed forms, then all is well. :-)

It sounds like Esperanto is far better than many others, probably because no thinking human would intentionally create irregular forms.