hme GLDv3 versus qfe DLPI
So, the NICs group recently told me I should have started with qfe instead of hme, because qfe has some performance fixes. (Such as hardware checksum, which I added to hme!) To find out out if this holds water, I ran some tests, on my 360 MHz UltraSPARC-IIi system, using a PCI qfe card. (You can make hme bind to the qfe ports by doing
# rem_drv qfe
# update-drv -a -i '"SUNW,qfe"' hme
(This by the way is a nice hack to use GLDv3 features with your qfe cards today if you cannot wait for an official GLDv3 qfe port.)
Anyway, here's what I found out, using my hacked ttcp utility. Note that the times reported are "sys" times.
QFE/DLPI
MTU = 100, -n 2048
Tx: 18.3 Mbps, 7.0s (98%)
Rx: 5.7 Mbps, 2.4s (10%)
MTU = 1500, -n = 20480
Tx (v4): 92.1 Mbps, 1.1s (8%)
Rx (v4): 92.2 Mbps, 1.6s (12%)
Tx (v6): 91.2 Mbps, 1.1s (8%)
Rx (v6): 90.9 Mbps, 2.6s (22%
UDPv4 tx, 1500 (-n 20480) 90.5 Mbps, 1.6 (64%)
UDPv4 tx, 128 (-n 204800) 34.2 Mbps, 5.2 (99%)
UDPv4 tx, 64 (-n 204800) 17.4 Mbps, 5.1 (99%)
And here are the numbers for hme with GLDv3
HME GLDv3
MTU = 100, -n 2048
Tx: 16.0 Mbps, 7.6s (93%)
Rx: 11.6 Mbps, 1.8s (16%)
MTU = 1500, -n = 20480
Tx (v4): 92.1 Mbps, 1.2s (8%)
Rx (v4): 92.2 Mbps, 3.2s (24%)
Tx (v6): 90.8 Mbps, 0.8 (6%)
Rx (v6): 91.2 Mbps, 4.0s (29%)
UDPv4 tx, 1500 (-n 20480) 89.7 Mbps, 1.5s (60%)
UDPv4 tx, 128 (-n 204800) 29.4 Mbps, 6.0s (99%)
UDPv4 tx, 64 (-n 204800) 14.8 Mbps, 6.0s (99%)
So, given these numbers, it appears that either QFE is more efficient (which is possible, but I'm slightly skeptical) or the cost of the extra overhead of some of the GLDv3 support is hurting us. I'm more inclined to believe this. (For example, we have to check to see if the packet is a VLAN tagged packet... those features don't come for free... :-)
What is really interesting, is that the hme GLDv3 work was about 3% better than the old DLPI hme. So clearly there has been more effort invested into qfe.
Interestingly enough, the performance for Rx tiny packets with GLDv3 is better. I am starting to wonder if there is a difference in the bcopy/dvma thresholds.
So one of the questions that C-Team has to answer is, how important are these relatively minor differences in performance. On a faster machine, you'd be unlikely to notice at all. If this performance becomes a gating factor, I might find it difficult to putback the qfe GLDv3 conversion.
To be completely honest, tracking down the 1-2% difference in performance may not be worthwhile. I'd far rather work on fixing 1-2% gains in the stack than worry about how a certain legacy driver performs.
What are your thoughts? Let me know!
# rem_drv qfe
# update-drv -a -i '"SUNW,qfe"' hme
(This by the way is a nice hack to use GLDv3 features with your qfe cards today if you cannot wait for an official GLDv3 qfe port.)
Anyway, here's what I found out, using my hacked ttcp utility. Note that the times reported are "sys" times.
QFE/DLPI
MTU = 100, -n 2048
Tx: 18.3 Mbps, 7.0s (98%)
Rx: 5.7 Mbps, 2.4s (10%)
MTU = 1500, -n = 20480
Tx (v4): 92.1 Mbps, 1.1s (8%)
Rx (v4): 92.2 Mbps, 1.6s (12%)
Tx (v6): 91.2 Mbps, 1.1s (8%)
Rx (v6): 90.9 Mbps, 2.6s (22%
UDPv4 tx, 1500 (-n 20480) 90.5 Mbps, 1.6 (64%)
UDPv4 tx, 128 (-n 204800) 34.2 Mbps, 5.2 (99%)
UDPv4 tx, 64 (-n 204800) 17.4 Mbps, 5.1 (99%)
And here are the numbers for hme with GLDv3
HME GLDv3
MTU = 100, -n 2048
Tx: 16.0 Mbps, 7.6s (93%)
Rx: 11.6 Mbps, 1.8s (16%)
MTU = 1500, -n = 20480
Tx (v4): 92.1 Mbps, 1.2s (8%)
Rx (v4): 92.2 Mbps, 3.2s (24%)
Tx (v6): 90.8 Mbps, 0.8 (6%)
Rx (v6): 91.2 Mbps, 4.0s (29%)
UDPv4 tx, 1500 (-n 20480) 89.7 Mbps, 1.5s (60%)
UDPv4 tx, 128 (-n 204800) 29.4 Mbps, 6.0s (99%)
UDPv4 tx, 64 (-n 204800) 14.8 Mbps, 6.0s (99%)
So, given these numbers, it appears that either QFE is more efficient (which is possible, but I'm slightly skeptical) or the cost of the extra overhead of some of the GLDv3 support is hurting us. I'm more inclined to believe this. (For example, we have to check to see if the packet is a VLAN tagged packet... those features don't come for free... :-)
What is really interesting, is that the hme GLDv3 work was about 3% better than the old DLPI hme. So clearly there has been more effort invested into qfe.
Interestingly enough, the performance for Rx tiny packets with GLDv3 is better. I am starting to wonder if there is a difference in the bcopy/dvma thresholds.
So one of the questions that C-Team has to answer is, how important are these relatively minor differences in performance. On a faster machine, you'd be unlikely to notice at all. If this performance becomes a gating factor, I might find it difficult to putback the qfe GLDv3 conversion.
To be completely honest, tracking down the 1-2% difference in performance may not be worthwhile. I'd far rather work on fixing 1-2% gains in the stack than worry about how a certain legacy driver performs.
What are your thoughts? Let me know!
Comments