Early performance numbers
I've added a benchmark tool to my Go implementation of nanomsg's SP protocols, along with the inproc transport, and I'll be pushing those changes rather shortly.
I didn't try larger sizes yet, this is just a quick sample test, not an exhaustive performance analysis. What is interesting is that the ipc case for my code is consistently low. It uses the same underlying transport to Go as TCP, but I guess maybe we are losing some TCP optimizations. (Note that the TCP tests were performed using loopback, I don't really have 40GbE on my desktop Mac. :-)
I think my results may be worse than they would otherwise be, because I use the equivalent of NN_MSG to dynamically allocate each message as it arrives, whereas the nanomsg benchmarks use a preallocated buffer. Right now I'm not exposing an API to use preallocated buffers (but I have considered it! It does feel unnatural though, and more of a "benchmark special".)
That said, I'm not unhappy with these numbers. Indeed, it seems that my code performs reasonably well given all the cards stacked against it. (Extra allocations due to the API, extra context switches due to extra concurrency using channels and goroutines in Go, etc.)
A litte more details about the tests.
All test were performed using nanomsg 0.3beta, and my current Go 1.2 tree, running on my Mac running MacOS X 10.9.2, on 3.2 GHz Core i5. The latency tests used full round trip timing using the REQ/REP topology, and a 111 byte message size. The throughput tests were performed using PAIR. (Good news, I've now validated PAIR works. :-)
The IPC was directed at file path in /tmp, and TCP used 127.0.0.1 ports.
Note that my inproc tries hard to avoid copying, but does still copy due to a mismatch about header vs. body location. I'll probably fix that in a future update (its an optimization, and also kind of a benchmark special since I don't think inproc gets a lot of performance critical use. In Go, it would be more natural to use channels for that.
In the meantime, here's some interesting results:
transport | nanomsg 0.3beta | gdamore/sp |
---|---|---|
inproc | 6.23 | 8.47 |
ipc | 15.7 | 22.6 |
tcp | 24.8 | 50.5 |
The numbers aren’t all that surprising. Using go, I’m using non-native interfaces, and my use of several goroutines to manage concurrency probably creates a higher number of context switches per exchange. I suspect I might find my stuff does a little better with lots and lots of servers hitting it, where I can make better use of multiple CPUs (though one could write a C program that used threads to achieve the same effect).
The story for throughput is a little less heartening though:
transport | message size | nanomsg 0.3beta | gdamore/sp |
---|---|---|---|
inproc | 4k | 4322 | 5551 |
ipc | 4k | 9470 | 2379 |
tcp | 4k | 9744 | 2515 |
inproc | 64k | 83904 | 21615 |
ipc | 64k | 38929 | 7831 (?!?) |
tcp | 64k | 30979 | 12598 |
I didn't try larger sizes yet, this is just a quick sample test, not an exhaustive performance analysis. What is interesting is that the ipc case for my code is consistently low. It uses the same underlying transport to Go as TCP, but I guess maybe we are losing some TCP optimizations. (Note that the TCP tests were performed using loopback, I don't really have 40GbE on my desktop Mac. :-)
I think my results may be worse than they would otherwise be, because I use the equivalent of NN_MSG to dynamically allocate each message as it arrives, whereas the nanomsg benchmarks use a preallocated buffer. Right now I'm not exposing an API to use preallocated buffers (but I have considered it! It does feel unnatural though, and more of a "benchmark special".)
That said, I'm not unhappy with these numbers. Indeed, it seems that my code performs reasonably well given all the cards stacked against it. (Extra allocations due to the API, extra context switches due to extra concurrency using channels and goroutines in Go, etc.)
A litte more details about the tests.
All test were performed using nanomsg 0.3beta, and my current Go 1.2 tree, running on my Mac running MacOS X 10.9.2, on 3.2 GHz Core i5. The latency tests used full round trip timing using the REQ/REP topology, and a 111 byte message size. The throughput tests were performed using PAIR. (Good news, I've now validated PAIR works. :-)
The IPC was directed at file path in /tmp, and TCP used 127.0.0.1 ports.
Note that my inproc tries hard to avoid copying, but does still copy due to a mismatch about header vs. body location. I'll probably fix that in a future update (its an optimization, and also kind of a benchmark special since I don't think inproc gets a lot of performance critical use. In Go, it would be more natural to use channels for that.
Comments
Great job with your implementation, BTW. I would suggest "go-nanomsg" or "go-nano" as a name for it.
Best regards.
I might play with it. For the most part, the *application* consists only of two threads -- the sender and the receiver. The underlying library makes much more use of goroutines though.