SP protocols improved again!

April 06, 2014

Introduction

As a result of some investigations performed in response to my first performance tests for my SP implementation, I've made a bunch of changes to my code.

First off, I discovered that my code was rather racy. When I started bumping up GOMAXPROCS, and and used the -race flag to go test, I found lots of issues.

Second, there were failure scenarios where the performance fell off a cliff, as the code dropped messages, needed to retry, etc.

I've made a lot of changes to fix the errors. But, I've also made a major set of changes which enable a vastly better level of performance, particularly for throughput sensitive workloads. Note that to get these numbers, the application should "recycle" the Messages it uses (using a new Free() API... there is also a NewMessage() API to allocate from the cache), which will cache and recycle used buffers, greatly reducing the garbage collector workload.

Throughput

So, here are the new numbers for throughput, compared against my previous runs on the same hardware, including tests against the nanomsg reference itself.

Throughput Comparision
(Mb/s)
transport	nanomsg 0.3beta	old gdamore/sp	new (1 thread)	new (2 threads)	new (4 threads)	new (8 threads)
inproc 4k	4322	5551	6629	7751	8654	8841
ipc 4k	9470	2379	6176	6615	5025	5040
tcp 4k	9744	2515	3785	4279	4411	4420
inproc 64k	83904	21615	45618	35044^b	44312	47077
ipc 64k	38929	7831^a	48400	65190	64471	63506
tcp 64k	30979	12598	34994	49608	53064	53432

^a I think this poor result is from retries or resubmits inside the old implementation.
^b I cannot explain this dip; I think maybe unrelated activity or GC activity may be to blame

The biggest gains are with large frames (64K), although there are gains for the 4K size as well. nanomsg still out performs for the 4K size, but with 64K my message caching changes pay dividends and my code actually beats nanomsg rather handily for the TCP and IPC cases.

I think for 4K, we're hurting due to inefficiencies in the Go TCP handling below my code. My guess is that there is a higher per packet cost here, and that is what is killing us. This may be true for the IPC case as well. Still, these are very respectable numbers, and for some very real and useful workloads my implementation compares and even beats the reference.

The new code really shows some nice gains for concurrency, and makes good use of multiple CPU cores.

There are a few mysteries though. Notes "a" and "b" point to two of them. The third is that the IPC performance takes a dip when moving from 2 threads to 4. It still significantly outperforms the TCP side though, and is still performing more than twice as fast as my first implementation, so I guess I shouldn't complain too much.

Latency

The latency has shown some marked improvements as well. Here are new latency numbers.

Latency Comparision
(usec/op)
transport	nanomsg 0.3beta	old gdamore/sp	new (1 thread)	new (2 threads)	new (4 threads)	new (8 threads)
inproc	6.23	8.47	6.56	9.93	11.0	11.2
ipc	15.7	22.6	27.7	29.1	31.3	31.0
tcp	24.8	50.5	41.0	42.7	42.9	42.9

All in all, the round trip times are reasonably respectable. I am especially proud of how close I've come within the best inproc time -- a mere 330 nsec separates the Go implementation from the nanomsg native C version. When you factor in the heavy use of go routines, this is truly impressive. To be honest, I suspect that most of those 330 nsec are actually lost in the extra data copy that my inproc implementation has to perform to simulate the "streaming" nature of real transports (i.e. data and headers are not separate on message ingress.)

There's a sad side to story as well. TCP handling seems to be less than ideal in Go. I'm guessing that some effort is done to use larger TCP windows, and Nagle may be at play here as well (I've not checked.) Even so, I've made a 20% improvement in latencies for TCP from my first pass.

The other really nice thing is near linear scalability when threads (via bumping GOMAXPROCS) are added. There is very, very little contention in my implementation. (I presume some underlying contention for the channels exists, but this seems to be on the order of only a usec or so.) Programs that utilize multiple goroutines are likely to benefit well from this.

Conclusion

Simplifying the code to avoid certain indirection (extra passes through additional channels and goroutines), and adding a message pooling layer, have yielded enormous performance gains. Go performs quite respectably in this messaging application, comparing favorably with a native C implementation. It also benefits from additional concurrency.

One thing I really found was that it took some extra time to get my layering model correct. I traded complexity in the core for some extra complexity in the Protocol implementations. But this avoided a whole other round of context switches, and enormous complexity. My use of linked lists, and the ugliest bits of mutex and channel synchronization around list-based queues, were removed. While this means more work for protocol implementors, the reduction in overall complexity leads to marked performance and reliability gains.

I'm now looking forward to putting this code into production use.

/dev/dump