Early performance numbers

March 26, 2014

I've added a benchmark tool to my Go implementation of nanomsg's SP protocols, along with the inproc transport, and I'll be pushing those changes rather shortly.

In the meantime, here's some interesting results:

Latency Comparision
(usec/op)
transport	nanomsg 0.3beta	gdamore/sp
inproc	6.23	8.47
ipc	15.7	22.6
tcp	24.8	50.5

The numbers aren’t all that surprising. Using go, I’m using non-native interfaces, and my use of several goroutines to manage concurrency probably creates a higher number of context switches per exchange. I suspect I might find my stuff does a little better with lots and lots of servers hitting it, where I can make better use of multiple CPUs (though one could write a C program that used threads to achieve the same effect).

The story for throughput is a little less heartening though:

Throughput Comparision
(Mb/s)
transport	message size	nanomsg 0.3beta	gdamore/sp
inproc	4k	4322	5551
ipc	4k	9470	2379
tcp	4k	9744	2515
inproc	64k	83904	21615
ipc	64k	38929	7831 (?!?)
tcp	64k	30979	12598

I didn't try larger sizes yet, this is just a quick sample test, not an exhaustive performance analysis. What is interesting is that the ipc case for my code is consistently low. It uses the same underlying transport to Go as TCP, but I guess maybe we are losing some TCP optimizations. (Note that the TCP tests were performed using loopback, I don't really have 40GbE on my desktop Mac. :-)

I think my results may be worse than they would otherwise be, because I use the equivalent of NN_MSG to dynamically allocate each message as it arrives, whereas the nanomsg benchmarks use a preallocated buffer. Right now I'm not exposing an API to use preallocated buffers (but I have considered it! It does feel unnatural though, and more of a "benchmark special".)

That said, I'm not unhappy with these numbers. Indeed, it seems that my code performs reasonably well given all the cards stacked against it. (Extra allocations due to the API, extra context switches due to extra concurrency using channels and goroutines in Go, etc.)

A litte more details about the tests.

All test were performed using nanomsg 0.3beta, and my current Go 1.2 tree, running on my Mac running MacOS X 10.9.2, on 3.2 GHz Core i5. The latency tests used full round trip timing using the REQ/REP topology, and a 111 byte message size. The throughput tests were performed using PAIR. (Good news, I've now validated PAIR works. :-)

The IPC was directed at file path in /tmp, and TCP used 127.0.0.1 ports.

Note that my inproc tries hard to avoid copying, but does still copy due to a mismatch about header vs. body location. I'll probably fix that in a future update (its an optimization, and also kind of a benchmark special since I don't think inproc gets a lot of performance critical use. In Go, it would be more natural to use channels for that.

Comments

Aram Hăvărneanu said…

Did you set GOMAXPROCS?

March 27, 2014 at 5:22 AM

Gonzus said…

Interesting results, Garrett. I think you should definitely add a way to use a preallocated buffer (maybe passed in as a parameter) to send / recv messages. This is what made the biggest difference in bringing the performance of the Java binding for ZeroMQ to a comparable level with the native library; I implemented that using ByteBuffers.

Great job with your implementation, BTW. I would suggest "go-nanomsg" or "go-nano" as a name for it.

Best regards.

March 27, 2014 at 7:42 AM

Garrett D'Amore said…

This comment has been removed by the author.

March 27, 2014 at 8:08 AM

Garrett D'Amore said…

No I didn't try GOMAXPROCS.

I might play with it. For the most part, the *application* consists only of two threads -- the sender and the receiver. The underlying library makes much more use of goroutines though.

March 27, 2014 at 8:11 AM

/dev/dump