SP protocols improved again!
Introduction
First off, I discovered that my code was rather racy. When I started bumping up GOMAXPROCS, and and used the -race flag to go test, I found lots of issues.
Second, there were failure scenarios where the performance fell off a cliff, as the code dropped messages, needed to retry, etc.
I've made a lot of changes to fix the errors. But, I've also made a major set of changes which enable a vastly better level of performance, particularly for throughput sensitive workloads. Note that to get these numbers, the application should "recycle" the Messages it uses (using a new Free() API... there is also a NewMessage() API to allocate from the cache), which will cache and recycle used buffers, greatly reducing the garbage collector workload.
So, here are the new numbers for throughput, compared against my previous runs on the same hardware, including tests against the nanomsg reference itself.
Throughput
transport | nanomsg 0.3beta | old gdamore/sp | new (1 thread) | new (2 threads) | new (4 threads) | new (8 threads) |
---|---|---|---|---|---|---|
inproc 4k | 4322 | 5551 | 6629 | 7751 | 8654 | 8841 |
ipc 4k | 9470 | 2379 | 6176 | 6615 | 5025 | 5040 |
tcp 4k | 9744 | 2515 | 3785 | 4279 | 4411 | 4420 |
inproc 64k | 83904 | 21615 | 45618 | 35044b | 44312 | 47077 |
ipc 64k | 38929 | 7831a | 48400 | 65190 | 64471 | 63506 |
tcp 64k | 30979 | 12598 | 34994 | 49608 | 53064 | 53432 |
a I think this poor result is from retries or resubmits inside the old implementation.
b I cannot explain this dip; I think maybe unrelated activity or GC activity may be to blame
The biggest gains are with large frames (64K), although there are gains for the 4K size as well. nanomsg still out performs for the 4K size, but with 64K my message caching changes pay dividends and my code actually beats nanomsg rather handily for the TCP and IPC cases.
I think for 4K, we're hurting due to inefficiencies in the Go TCP handling below my code. My guess is that there is a higher per packet cost here, and that is what is killing us. This may be true for the IPC case as well. Still, these are very respectable numbers, and for some very real and useful workloads my implementation compares and even beats the reference.
The new code really shows some nice gains for concurrency, and makes good use of multiple CPU cores.
There are a few mysteries though. Notes "a" and "b" point to two of them. The third is that the IPC performance takes a dip when moving from 2 threads to 4. It still significantly outperforms the TCP side though, and is still performing more than twice as fast as my first implementation, so I guess I shouldn't complain too much.
The latency has shown some marked improvements as well. Here are new latency numbers.
b I cannot explain this dip; I think maybe unrelated activity or GC activity may be to blame
The biggest gains are with large frames (64K), although there are gains for the 4K size as well. nanomsg still out performs for the 4K size, but with 64K my message caching changes pay dividends and my code actually beats nanomsg rather handily for the TCP and IPC cases.
I think for 4K, we're hurting due to inefficiencies in the Go TCP handling below my code. My guess is that there is a higher per packet cost here, and that is what is killing us. This may be true for the IPC case as well. Still, these are very respectable numbers, and for some very real and useful workloads my implementation compares and even beats the reference.
The new code really shows some nice gains for concurrency, and makes good use of multiple CPU cores.
There are a few mysteries though. Notes "a" and "b" point to two of them. The third is that the IPC performance takes a dip when moving from 2 threads to 4. It still significantly outperforms the TCP side though, and is still performing more than twice as fast as my first implementation, so I guess I shouldn't complain too much.
Latency
transport | nanomsg 0.3beta | old gdamore/sp | new (1 thread) | new (2 threads) | new (4 threads) | new (8 threads) |
---|---|---|---|---|---|---|
inproc | 6.23 | 8.47 | 6.56 | 9.93 | 11.0 | 11.2 |
ipc | 15.7 | 22.6 | 27.7 | 29.1 | 31.3 | 31.0 |
tcp | 24.8 | 50.5 | 41.0 | 42.7 | 42.9 | 42.9 |
All in all, the round trip times are reasonably respectable. I am especially proud of how close I've come within the best inproc time -- a mere 330 nsec separates the Go implementation from the nanomsg native C version. When you factor in the heavy use of go routines, this is truly impressive. To be honest, I suspect that most of those 330 nsec are actually lost in the extra data copy that my inproc implementation has to perform to simulate the "streaming" nature of real transports (i.e. data and headers are not separate on message ingress.)
There's a sad side to story as well. TCP handling seems to be less than ideal in Go. I'm guessing that some effort is done to use larger TCP windows, and Nagle may be at play here as well (I've not checked.) Even so, I've made a 20% improvement in latencies for TCP from my first pass.
The other really nice thing is near linear scalability when threads (via bumping GOMAXPROCS) are added. There is very, very little contention in my implementation. (I presume some underlying contention for the channels exists, but this seems to be on the order of only a usec or so.) Programs that utilize multiple goroutines are likely to benefit well from this.
Conclusion
One thing I really found was that it took some extra time to get my layering model correct. I traded complexity in the core for some extra complexity in the Protocol implementations. But this avoided a whole other round of context switches, and enormous complexity. My use of linked lists, and the ugliest bits of mutex and channel synchronization around list-based queues, were removed. While this means more work for protocol implementors, the reduction in overall complexity leads to marked performance and reliability gains.
I'm now looking forward to putting this code into production use.
Comments