On Go, Portability, and System Interfaces

September 22, 2015

I've been noticing more and more lately that we have a plethora of libraries and programs written for Go, which don't work on one platform or another. The root cause of these is often the use of direct system call coding to system calls such as ioctl(). On some platforms (illumos/solaris!) there is no such system call.

The Problems

But this underscores a far far worse problem, that has become common (mal)-practice in the Go community. That is, the coding of system calls directly into high level libraries and even application programs. For example, it isn't uncommon to see something like this (taken from termbox-go):

func tcsetattr(fd uintptr, termios *syscall_Termios) error {
r, _, e := syscall.Syscall(syscall.SYS_IOCTL,
fd, uintptr(syscall_TCSETS), uintptr(unsafe.Pointer(termios)))
if r != 0 {
return os.NewSyscallError("SYS_IOCTL", e)
}
return nil
}

This has quite a few problems with it.

It's not platform portable. This function depends on a specific implementation of tcsetattr() that is done in terms of specific ioctl()s. For example, TCSETS may be used on one platform, but on others TIOCSETA might be used.
It's not Go portable, since SYS_IOCTL isn't implemented on platforms like illumos, even though as a POSIX system we do have a working tcsetattr().
The code is actually pretty unreadable, and somewhat challenging to write the first time correctly.
The code uses unsafe.Pointer(), which is clearly something we ought to avoid.
On some platforms, the details of the ioctls are subject to change, so that the coding above is actually fragile. (In illumos & Solaris system call interfaces are "undocumented", and one must use the C library to access system services. This is our "stable API boundary". This is somewhat different from Linux practice; the reasons for this difference is both historical and related to the fact that Linux delivers only a kernel, while illumos delivers a system that includes both the kernel and core libraries.)

How did we wind up in this ugly situation?

The problem I believe stems from some misconceptions, and some historical precedents in the Go community. First the Go community has long touted static linking as one of its significant advantages. However, I believe this has been taken too far.

Why is static linking beneficial? The obvious (to me at any rate) reason is to avoid the dependency nightmares and breakage that occurs with other systems where many dynamic libraries are brought together. For example, if A depends directly on both B and C, and B depends on C, but some future version of B depends on a newer version of C that is incompatible with the version of C that A was using, then we cannot update A to use the new B. And when the system components are shared across the entire system, the web of dependencies gets to be so challenging that managing these dependencies in real environments can become a full time job, consuming an entire engineering salary.

You can get into surprising results where upgrading one library can cause unexpected failures in some other application. So the desire to avoid this kind of breakage is to encode the entire binary together, in a single stand-alone executable, so that we need never have a fear as to whether our application will work in the future or not. As I will show, we've not really achieved this with 100% statically linked executables in Go, though I'll grant that we have greatly reduced the risk.

This is truly necessary because much of the open source ecosystem has no idea about interface stability nor versioning interfaces. This is gradually changing, such that we now have ideas like semver coming around as if they are somehow new and great ideas. The reality is that commercial operating system vendors have understood the importance of stable API boundaries for a very very long time. Some, like Sun, even made legally binding promises around the stability of their interfaces. However, in each of these cases, the boundary has to a greater or lesser extent been at the discretion of the vendor.

Until we consider standards such as POSIX 1003.1. Some mistakenly believe that POSIX defines system calls. It does not. It defines a C function call interface. The expectation is that many of these interfaces have 1:1 mappings with system calls, but the details of those system calls are completely unspecified by POSIX.

Basically, the Go folks want to minimize external dependencies and the web of failure that can lead to. Fixing that is a goal I heartily agree with. However, we cannot eliminate our dependency on the platform. And using system calls directly is actually worse, because it moves our dependency from something that is stable and defined by standards bodies, to an interface that is undocumented, not portable, and may change at any time.

If you're not willing to have a dynamic link dependency on the C library, why would you be willing to have a dependency on the operating system kernel? In fact, the former is far safer than the latter! (And on Solaris, you don't have a choice -- the Go compiler always links against the system libraries.)

Harmful results that occur with static linking

If the application depends on a library that has a critical security update, it becomes necessary to recompile the application. If you have a low level library such as a TLS or HTTP client, and a security fix for a TLS bug is necessary (and we've never ever ever had any bugs in TLS or SSL implementation, right?), this could mean recompiling a very large body of software to be sure you've closed the gaps.
With statically linked programs, even knowing which applications need to be updated can be difficult or impossible. They defy the most easy kinds of inspection, using tools like ldd or otool to see what they are built on top of.

What is also tragic, is that static executables wind up encoding the details of the kernel system call interface into the binary. On some systems this isn't a big deal because they have a stable system call interface. (Linux mostly has this -- although glibc still has to cope with quite a few differences here by handling ENOSYS, and don't even get me started on systemd related changes.) But on systems like Solaris and illumos, we've historically considered those details a private implementation detail between libc and kernel. And to prevent applications from abusing this, we don't even deliver a static libc. This gives us the freedom to change the kernel/userland interface fairly freely, without affecting applications.

When you consider standards specifications like POSIX or X/OPEN, this approach makes a lot of sense. They standardize the C function call interface, and leave the kernel implementation up to the implementor.

But statically linked Go programs break this, badly. If that kernel interface changes, we can wind up breaking all of the Go programs that use it, although "correct" programs that only use libc will continue to work fine.

The elephant in the room (licensing)

The other problem with static linking is that it can create a license condition that is very undesirable. For example, glibc is LGPL. That means that per the terms of the LGPL it must be possible to relink against a different glibc, if you link statically.

Go programs avoid this by not including any of the C library statically. Even when cgo is used, the system libraries are linked dynamically. (This is usually the C library, but can include things like a pthreads library or other base system libraries.)

In terms of the system, the primary practice for Go programmers has been to use licenses like MIT, BSD, or Apache, that are permissive enough that static linking of 3rd party Go libraries is usually not a problem. I suppose that this is a great benefit in that it will serve to help prevent GPL and LGPL code from infecting the bulk of the corpus of Go software.

The Solutions

The solution here is rather straightforward.

First, we should not eschew use of the C library, or other libraries that are part of the standard system image. I'm talking about things like libm, libc, and for those that have them, libpthread, libnsl, libsocket. Basically the standard libraries that every non-trivial program has to include. On most platforms this is just libc. If recoded to use the system's tcsetattr (which is defined to exist by POSIX), the above function looks like this:

// include <termios.h>
import "C"
import "os"

func tcsetattr(f *os.File, termios *C.struct_termios) error {
_, e := C.tcsetattr(C.int(f.Fd(), C.TCSANOW, termios)
return e
}

The above implementation will cause your library or program to dynamically link against and use the standard C library on the platform. And it works on all POSIX systems everywhere and because it uses a stable documented standard API, it is pretty much immune to breakage from changes elsewhere in the system. (At least any change that broke this implementation would also break so many other things that the platform would be unusable. Generally we can usually trust people who make the operating system kernel and C library to not screw things up too badly.)

What would be even better, and cleaner, would be to abstract that interface above behind some Go code, converting between a Go struct and the C struct as needed, just as is done in much of the rest of the Go runtime. The logical place to do this would be in the standard Go system libraries. I'd argue rather strongly that core services like termio handling ought to be made available to Go developers in the standard system libraries that are part of Go, or perhaps more appropriately, with the golang.org/x/sys/unix repository.

In any event, if you're a Go programmer, please consider NOT directly calling syscall interfaces, but instead using higher level interfaces, and when those aren't already provided in Go, don't be afraid to use cgo to access standard functions in the C library. Its far far better for everyone that you do this, than that you code to low level system calls.

Comments

Maor said…

This is wrong in so many levels.
1. Calling c function is 10x time slower, so one should not use it in where performance matter.
2. Static or dynamic linking have no difference from licensing side.
If the problem in unstable kernel interface, the real solution is building the libc in as go library and recommend to dynamic link with it.

September 23, 2015 at 11:02 AM

Garrett D'Amore said…

Your reply is wrong on so many more levels.

1. Yes, it turns out that calling a function in C from Go is about a lot more expensive than calling a native Go function from Go. (This was very surprising to me; the reason its taken so long to get back to you on this is that I wanted to understand that result, since intuitively it made no sense at all. So I spent some effort measuring and researching it. I discovered that the issue that there is a context switch that Go makes when calling to C code. This costs about 100-200 ns. Ordinary function call overheads are tiny. I had trouble measuring the cost to add a stack frame because of optimization.)

That said, this is *nothing* compared to the cost of a *SYSTEM CALL*. System calls are measured in usec, usually 10s of them, depending on the complexity of the system call you're calling. (In some cases, such as disk reads, it can even go to milliseconds.) I'd guess on average the cost of the transition is less than 1% of the system call you're making. If you're optimizing for that 1%, you're doing it wrong.

Second, we are talking about system calls. If these are in a hot code path (and I'm not talking about ordinary read/write/send/recv here), then you're also doing it wrong. For example, how often do you need to call an ioctl to set up terminal IOs? In my case its *once* at the start of the program, and once at the end. Any system call you need more often than that is going to have native support in Go, almost certainly.

2. You obviously have *zero* clue about software licensing. Static vs. dynamic linking is at the *heart* of the problem, particularly where GPL & LGPL type licenses are concerned. These issues are explicitly dealt with in those licenses. While I'm not a lawyer, I do have some non-trivial amount of experience and knowledge here. Please research this before you speak up, adding further confusion to the licensing considerations is most definitely *not* helpful.

September 28, 2015 at 7:46 AM

Garrett D'Amore said…

Another follow up, I want to make, based on comments made in twitter.

First off, it has been rightly pointed out to me that using CGO screws with cross-compiling. Specifically, it makes cross-compiling much much harder because at that point you need a cross-capable gcc. Bootstrapping gcc in this manner is something akin to black magic -- embedded systems developers often set this up, but most ordinary developers consider this to be beyond reach, or at least far more effort than they want to invest in a tool chain. (I can't really blame them.)

Second, none of anything I've said really applies at all to Windows. This is true. Windows "system call" mechanism uses DLL symbol lookups (of well known/documented symbols & DLLs usually), and calls through them. This *is* the documented interface to use on Windows, and generally involves neither magic numbers nor knowledge of kernel internals on Windows. Go uses this interface itself, internally, and even offers a nice "Proc" type for doing this, that exports a Call() method. About the only pain here is lack of type safety, since these function calls all use some number of pointers & uintptr_t's. Not too bad though. The fact that developers ever need to be exposed to these details is an unfortunate aspect of Windows, but its the nature of that platform. Admittedly I'd like to see Go do a better & broader job of hiding these. For example, Go *could* offer a tcsetattr() and tcgetattr() implementation that under the hood called the system's SetConsoleMode and GetConsoleMode. (I suppose the reverse is true too -- they could offer a SetConsoleMode & GetConsoleMode workalike for POSIX platforms. I'm not sure which is more familiar, but at least tcsetattr and tcgetattr are covered by formal standards.)

Of course I believe offering more direct bindings to standard functions, preventing the need for either C go or direct system call binding by application programs, is superior. Under the hood, if Go wants to use undocumented syscalls, I guess I'm kind of OK with that -- its a platform risk. Actually on Solaris they don't do that, but do use the safe C library calls, which is the platform where I think this is the biggest concern.

But, when the Go standard library falls short, I'd far far rather have to create a burden on the developer to either have to build on the platform or install a cross-tool chain, than to accept the situation where we can only operate on a subset of known operating systems, or worse, where the application can break unexpectedly with some future kernel upgrade.

Put another way, I'd rather accept some minor tool chain headaches for the developer and have broader, portable, and safe programs, than have a super easy environment for the developer that can only create non-portable & fragile applications.

Not everyone agrees with me.

And yes, for Linux, if you want to use system calls, you *can* use the build variants to achieve some kind of compromise. I just don't like the situation where instead of build variants that offer the compromise, developers *only* build the Linux solution, and just let everyone else break.

September 28, 2015 at 8:05 AM

Rob said…

"I just don't like the situation where instead of build variants that offer the compromise, developers *only* build the Linux solution, and just let everyone else break."

We absolutely agree here. I used to help support the Solaris KDE builds, and remember fixing so many Linux-specific commits to keep it working. I finally surrendered. In the Linux world, "portable" often means "runs on both Ubuntu and Red Hat."

I think we disagree on which approach (syscalls or cgo) is more fragile, but you make your case well. The Go core team has generally removed cgo requirements over time, and I think they're going in the right direction. I think history backs up the assertion that it's dynamic linking that leads to "fragile applications" and that static linking was a move towards robustness, not developer convenience. I'd rather write the low level code correctly for each platform I support than to hope that my portable solution will in fact work on all platforms. I've yet to find a case where large-scale systems "just work" merely by writing to documented specs.

But I'm not arguing against your experiences if they've been different. I think you've made an interesting case, and the points shouldn't be discounted out of hand.

September 28, 2015 at 8:26 AM

Garrett D'Amore said…

I think the key point is that dynamic linking against third party libraries definitely fragile.

Conversely, in my experience, static linking against system libraries (which ultimately result in a kind of "dynamic link" against the undocumented kernel/userland system call boundary) is also fragile.

So, what I'm proposing is that we continue static linking for most libraries, but consider using cgo for those few system libraries which are platform necessary/intrinsic. (I'd limit this to libc, and the tiny subset of additional libraries which are always present and export only standardized/documented APIs -- e.g. libsocket/libnsl on Solaris, etc.) In fact, in the *vast* majority of cases I can see for cgo this way, the only library I'd consider this way is libc.

Of course, there are cases where you need functionality which just isn't possible (at least not easily) except via C foreign function calls. For example, if you needed to access some proprietary rendering library. Fortunately, I don't have to do that. The one case where it came up was libnanomsg, and instead of doing that I chose to implement a wire compatible implementation in pure Go. I guess that's not always a practical solution though. In retrospect, I'm glad that I didn't have Cgo support for illumos two years ago, otherwise I'd never have built mangos, and I think mangos is actually quite superior to the FFI based alternative, for a number of reasons. Plus I learned a lot more about Go as a result of building mangos.

September 28, 2015 at 9:25 AM

/dev/dump