Thursday, February 4, 2010

Scalability FUD

Yesterday I saw yet another argument about the Linux vs. Solaris scalability debate. The Linux fans were loudly proclaiming that the claim of Solaris' superior scalability is FUD in the presence of evidence like the Cray XT class of systems which utilize thousands of processors in a system, running Linux.

The problem with comparing (or even considering!) the systems in the Top500 supercomputers when talking about "scalability" is simply that those systems are irrelevant for the typical "scalability" debate -- at least as it pertains to operating system kernels.

Irrelevant?! Yes. Irrelevant. Let me explain.

First, one must consider the typical environment and problems that are dealt with in the HPC arena. In HPC (High Performance Computing), scientific problems are considered that are usually fully compute bound. That is to say, they spend a huge majority of their time in "user" and only a minuscule tiny amount of time in "sys". I'd expect to find very, very few calls to inter-thread synchronization (like mutex locking) in such applications.

Second, these systems are used by users who are willing, expect, and often need, to write custom software to deal with highly parallel architectures. The software deployed into these environments is tuned for use in situations where the synchronization cost between processors is expected to be "relatively" high. Granted the architectures still attempt to minimize such costs, using very highly optimized message passing busses and the like.

Third many of these systems (most? all?) are based on systems that don't actually run a single system image. There is not a single universal addressable memory space visible to all processors -- at least not without high NUMA costs requiring special programming to deliver good performance, and frequently not at all. In many ways, these systems can be considered "clusters" of compute nodes around a highly optimized network. Certainly, programming systems like the XT5 is likely to be similar in many respects to programming software for clusters using more traditional network interconnects. An extreme example of this kind of software is SETI@home, where the interconnect (the global Internet) can be extremely slow compared to compute power.

So why does any of this matter?

It matters because most traditional software is designed without NUMA-specific optimizations, or even cluster-specific optimizations. More traditional software used in commercial applications like databases, web servers, business logic systems, or even servers for MMORPGs spend a much larger percentage of their time in the kernel, either performing some fashion of I/O or inter-thread communication (including synchronization like mutex locks and such.)

Consider a massive non-clustered database. (Note that these days many databases are designed for clustered operation.) In this situation, there will be some kind of central coordinator for locking and table access, and such, plus a vast number of I/O operations to storage, and a vast number of hits against common memory. These kinds of systems spend a lot more time doing work in the operating system kernel. This situation is going to exercise the kernel a lot more fully, and give a much truer picture of "kernel scalability" -- at least as the arguments are made by the folks arguing for or against Solaris or Linux superiority.

Solaris aficionados claim it is more scalable in handling workloads of this nature -- that a single SMP system image supporting traditional programming approaches (e.g. a single monolithic process made up of many threads for example) will experience better scalability on a Solaris system than on a Linux system.

I've not measured it, so I can't say for sure. But having been in both kernels (and many others), I can say that the visual evidence from reading the code is that Solaris seems like it ought to scale better in this respect than any of the other commonly available free operating systems. If you don't believe me, measure it -- and post your results online. It would be wonderful to have some quantitative data here.

Linux supporters, please, please stop pointing at the Top500 as evidence for Linux claims of superior scalability though. If there are some more traditional commercial kinds of single-system deployments that can support your claim, then lets hear about them!


rubycodez said...

Garrett, I'm afraid your article about Linux advocate FUD also contains FUD, saying you read some kernel source and opined a certain result likely would happen is no proof nor grounds for implying Solaris is more scalable. And you haven't done tests, as you said.

Now you do have some great points about how real world benchmarks outside of the realm of HPC would be more useful to the majority of the business world in comparing operating systems on the same hardware. And actually there are database benchmarks out there for non-clustered systems which might be useful to your argument though of course benchmarks have their limitations. for example has some results which I'll mention, but maybe you'd want more cores than there 64 to 128 they mention or have issues with what a tcp-c is doing compared to real world use.

Even so, for raw performance we find neither Solaris nor Linux at top of heap at the moment, HP/UX and AIX kicking their metaphorical keisters. RedHat Linux is in the top ten, though, while Solaris is not.

And there are other kinds of scalability besides going to huge number of cores, GNU/Linux (the Linux kernel plus the GNU utilities and libraries that make a full working OS to run business applications, such as RedHat or SuSE or Debian or Mandrake) scales down to embedded devices, laptops, cell phones, PCs and Laptops, all the way to SMP and NUMA systems with hundreds or (in some cases) thousands of processors for certain architectures.

For that Solaris has problems, it only supports less than 5% of the hardware devices Linux does, and only runs on UltraSparc or x86-32 bit or x86-64 bit hardware. GNU/Linux can run on over a dozen types of processor besides those two or three categories, including IBM PowerPC or Intel Itanium2 or IBM System/390. Solaris isn't an option for any of that, and doesn't have the degree of hardware and device vendor lock-in.

As it happens, I architect and migrate systems, for city and state governments and manufacturing plants and financial institutions. Since the businses applications most places want to run will run on Linux, I can tell you I have done some migrations to Linux from Solaris for clients with budgets in the billions of dollars, it turned out to be more cost effective. No one of our clients in the past ten years has migrated the other way, nor from any other Unix to Solaris. Oracle might be able to force its customers to use Solaris, but that won't be by merit in the majority of cases in the realm of 1 to 64 core systems.

Garrett D'Amore said...

Hmmm.. I see that there are non-clustered vs clustered results for TPC. I'm not familiar with those benchmarks, but its a bit surprising (to me at least) that Solaris is not listed there. But I'm not seeing any obvious trend in the results.

Looking at other benchmarks (like the SPEC benchmarks), it seems that the total number of cores reported even for the largest configurations are usually small. The exception are the CMT systems from Sun, like the T5440.

As far as hardware support, I think you need to look at OpenSolaris, which has far broader hardware support than the legacy S10 product. It has even been ported to ARM and System 390. However, admittedly the focus for OpenSolaris is on the mainstream x86 and SPARC products. Solaris has no desire to become the operating system for your mobile phone or your set-top box.

Hopefully Oracle will be addressing the lack of hardware support in S10 by creating a follow on "Enterprise Ready" release based on OpenSolaris soon.

Garrett D'Amore said...

Here are some real world numbers, btw: