Fun with terminals, character sets, Unicode, and Go

As part of my recent work on Tcell, I've recently added some pretty cool functionality for folks who want to have applications that can work reasonably in many different locales.

For example, if your terminal is running in UTF-8, you can access a huge repertoire of glyphs / characters.

But if you're running in a non-UTF-8 terminal, such as an older ISO 8859-1 (Latin1) or KOI8-R (Russian) locale, you might have problems.  Your terminal won't be able to display UTF-8, and your key strokes will probably be reported to the application using some 8-bit variant that is incompatible with UTF-8.  (For ASCII characters, everything will work, but if you want to enter a different character, like Я (Russian for "ya"), you're going to have difficulties.

If you work on the console of your operating system, you probably have somewhere around 220 characters to play with.  You're going to miss some of those glyphs.

Go of course works with UTF-8 natively.  Which is just awesome.

Until you have to work in one of these legacy environments.   And some of the environments are not precisely "legacy".  (GB18030 has the same repertoire as UTF-8, but uses a different encoding scheme and is legally mandatory within China.)

If you use Tcell for your application's user interface, this is now "fixed".

Tcell will attempt to convert to characters that the user's terminal understands on output, provided the user's environment variables are set properly ($LANG, $LC_ALL, $LC_CTYPE, per POSIX).  It will also convert the user's key strokes from your native locale to UTF-8.  This means that YOU, the application developer, can just worry about UTF-8, and skip the rest.  (Unless you want to add new Encodings, which is entirely possible.)

Tcell even goes further.

It will use the alternate character set (ACS) to convert Unicode drawing characters to the characters supported by the terminal, if they exist -- or to reasonable ASCII fallbacks if they don't.  (Just like ncurses!)

It will also cope with both East Asian full-width (or ambiguous width) characters, and even properly handles combining characters.  (If your terminal supports it, these are rendered properly on the terminal.  If it doesn't, Tcell makes a concerted effort to make a best attempt at rendering -- preserving layout and presenting the primary character even if the combining character cannot be rendered.)

The Unicode (and non-Unicode translation) handling capabilities in Tcell far exceed any other terminal handling package I'm aware of.

Here are some interesting screen caps, taken on a Mac using the provided unicode.go test program.

First the UTF-8.  Note the Arabic, the correct spacing of the Chinese glyphs, and the correct rendering of Combining characters.  (Note also that emoji are reported as width one, instead of two, and so take up more space than they should.  This is a font bug on my system -- Unicode says these are Narrow characters.)
Then we run in ISO8859-1 (Latin 1).  Here you can see the accented character available in the Icelandic word, and some terminal specific replacements have been made for the drawing glyphs.  ISO 8859-1 lacks most of the unusual or Asian glyphs, and so those are rendered as "?".  This is done by Tcell -- the terminal never sees the raw Unicode/UTF-8.  That's important, since sending the raw UTF-8 could cause my terminal to do bad things.

Note also that the widths are properly handled, so that even though we cannot display the combining characters, nor the full-width Chinese characters, the widths are correct -- 1 cell is taken for the combining character combinations, and 2 cells are taken by the full width Chinese characters.

Then we show off legacy Russian (KOI8-R):   Here you can see Cyrillic is rendered properly, as well as the basic ASCII and the alternate (ACS) drawing characters (mostly), while the rest are filled with place holder ?'s.

And, for those of you in mainland China, here's GB18030:  Its somewhat amusing that the system font seems to not be able to cope with the combining enclosure here.  Again, this is a font deficiency in the system.

As you can see, we have a lot of rendering options.  Input is filtered and converted too.  Unfortunately, the mouse test program I use to verify this doesn't really show this (since you can't see what I typed), but the Right Thing happens on input too.

Btw, those of you looking for mouse usage in your terminal should be very very happy with Tcell.  As far as I can tell, Tcell offers improved mouse handling on stock XTerm over every other terminal package.  This includes live mouse reporting, click-drag reporting, etc.   Here's what the test program looks like on my system, after I've click-dragged to create a few boxes:

I'm super tempted to put all this together to write an old DOS-style game.  I think Tcell has everything necessary here to be used as the basis for some really cool terminal hacks.

Give it a whirl if you like, and let me know what you think.


Garrett D'Amore said…
I just noticed that the GB18030 didn't include the glyphs for the Chinese full-width characters. That is fixed now.. see

Popular posts from this blog

SP (nanomsg) in Pure Go

An important milestone

The Hand May Be Forced