I've recently hacked up the Realtek driver (rtls) to support GLDv3. Its part of usr/closed right now (though I hope we can open source it!), so I can only share binaries.
Anyway, if you're stuck with this driver on your x86 system (because its on your motherboard, usually), and you want to try running a GLDv3 version of the driver, let me know.
The GLDv3 brings link aggregation support, VLAN support, and virtualization (IP instances) with it.
Of course the hardware is still somewhat crummy, so I wouldn't expect to get much performance out of it. But again, if you're stuck with it (as many people probably are) this may be helpful.
Tuesday, August 14, 2007
Monday, August 6, 2007
Dropping the "C"
For those not in the know, its now official. I'll be (re-)joining Sun as a regular full time employee starting August 20th. That means that I get to drop the "C" in front of my employee ID.
I'll be reporting to Neal Pollack, initially working on various Intel related Solaris projects.
I'll be reporting to Neal Pollack, initially working on various Intel related Solaris projects.
Wednesday, August 1, 2007
hme checksum limitations
(This blog is as much for the benefit for other FOSS developers as it is for OpenSolaris.)
Please have a look at 6587116, which points out a hardware limitation in the hme chipset. I've found that at least NetBSD, and probably also Linux, suffer in that they expect the chip to support hardware checksum offload. However, if the packet is less than 64-bytes (not including FCS), the hardware IP checksum engine will fail. This means all packets that get padded, and even some that are otherwise legal (not needing padding) will not be checksummed properly.
For these packets, software checksum must be used.
Please have a look at 6587116, which points out a hardware limitation in the hme chipset. I've found that at least NetBSD, and probably also Linux, suffer in that they expect the chip to support hardware checksum offload. However, if the packet is less than 64-bytes (not including FCS), the hardware IP checksum engine will fail. This means all packets that get padded, and even some that are otherwise legal (not needing padding) will not be checksummed properly.
For these packets, software checksum must be used.
partial checksum bug
As a result of investigation of a fix for 6587116 (a bug in HME, more later), we have found a gaping bug in the implementation of UDP checksums on Solaris.
Most particularly, it appears that UDP hardware checksum offload is broken for the cases where the checksum calculation will result in a 16-bit value of 0. Most protocols (TCP, ICMP, etc.) specify that the value 0 be used for the checksum in this case.
UDP, however, specifies that the value 0xffff be substituted for 0. Why ? Because 0 is given special meaning. In IPv4 networks, it means that transmitter did not bother to include a checksum. In IPv6, the checksum is mandatory, and RFC 2460 says that when the receiver sees a packet with a zero checksum it should be discarded.
The problem is, the hardware commonly in use on Sun SPARC systems (hme, eri, ge, and probably also ce and nxge) does not have support for this particular semantic. Furthermore, we have no way to know, in the current spec, if this semantic should be applied (short of directly parsing the packet, which presents its own challenges and hits to performance).
We'll have to figure out how to deal with this particular problem, sometime soonish. My guess is that all Sun NICs will lose IP checksum acceleration (transmit side only) for UDP datagrams, and that those 3rd party products which can do something different will need another flag bit indicating UDP semantics.
Most particularly, it appears that UDP hardware checksum offload is broken for the cases where the checksum calculation will result in a 16-bit value of 0. Most protocols (TCP, ICMP, etc.) specify that the value 0 be used for the checksum in this case.
UDP, however, specifies that the value 0xffff be substituted for 0. Why ? Because 0 is given special meaning. In IPv4 networks, it means that transmitter did not bother to include a checksum. In IPv6, the checksum is mandatory, and RFC 2460 says that when the receiver sees a packet with a zero checksum it should be discarded.
The problem is, the hardware commonly in use on Sun SPARC systems (hme, eri, ge, and probably also ce and nxge) does not have support for this particular semantic. Furthermore, we have no way to know, in the current spec, if this semantic should be applied (short of directly parsing the packet, which presents its own challenges and hits to performance).
We'll have to figure out how to deal with this particular problem, sometime soonish. My guess is that all Sun NICs will lose IP checksum acceleration (transmit side only) for UDP datagrams, and that those 3rd party products which can do something different will need another flag bit indicating UDP semantics.
Friday, July 27, 2007
nxge and IP forwarding
You may or may not be aware of project Sitara. One of the goals of project Sitara is to fix the handling of small packets.
I have achieved a milestone... using a hacked version of the nxge driver (diffs available on request), I've been able to get UDP forwarding rates as high as 1.3M packets per sec (unidirectional) across a single pair of nxge ports, using Sun's next sun4v processor. (That's number of packets forwarded...) This is very close to line rate for a 1G line. I'm hoping that future enhancements will get us to significantly more than that... maybe as much as 2-3 Mpps per port. Taken as an aggregate, I expect this class of hardware to be able to forward up to 8Mpps. (Some Sun internal numbers using a microkernel are much higher than that... but then you'd lose all the nice features that the Solaris TCP/IP stack has.)
By the way, its likely that these results are directly applicable to applications like Asterisk (VoIP), where small UDP packets are heavily used. Hopefully we'll have a putback of the necessary tweaks before too long.
I have achieved a milestone... using a hacked version of the nxge driver (diffs available on request), I've been able to get UDP forwarding rates as high as 1.3M packets per sec (unidirectional) across a single pair of nxge ports, using Sun's next sun4v processor. (That's number of packets forwarded...) This is very close to line rate for a 1G line. I'm hoping that future enhancements will get us to significantly more than that... maybe as much as 2-3 Mpps per port. Taken as an aggregate, I expect this class of hardware to be able to forward up to 8Mpps. (Some Sun internal numbers using a microkernel are much higher than that... but then you'd lose all the nice features that the Solaris TCP/IP stack has.)
By the way, its likely that these results are directly applicable to applications like Asterisk (VoIP), where small UDP packets are heavily used. Hopefully we'll have a putback of the necessary tweaks before too long.
mpt SAS support on NetBSD
FYI, NetBSD has just got support for the LSI SAS controllers, such as that found on the Sun X4200. My patch to fix this was committed last night. (The work was a side project funded by TELES AG.)
Of course we'd much rather everyone ran Solaris on these machines, but if you need NetBSD for some reason, it works now.
Pullups to NetBSD 3 and 3.1 should be forthcoming.
Of course we'd much rather everyone ran Solaris on these machines, but if you need NetBSD for some reason, it works now.
Pullups to NetBSD 3 and 3.1 should be forthcoming.
Wednesday, July 25, 2007
hme GLDv3 versus qfe DLPI
So, the NICs group recently told me I should have started with qfe instead of hme, because qfe has some performance fixes. (Such as hardware checksum, which I added to hme!) To find out out if this holds water, I ran some tests, on my 360 MHz UltraSPARC-IIi system, using a PCI qfe card. (You can make hme bind to the qfe ports by doing
# rem_drv qfe
# update-drv -a -i '"SUNW,qfe"' hme
(This by the way is a nice hack to use GLDv3 features with your qfe cards today if you cannot wait for an official GLDv3 qfe port.)
Anyway, here's what I found out, using my hacked ttcp utility. Note that the times reported are "sys" times.
QFE/DLPI
MTU = 100, -n 2048
Tx: 18.3 Mbps, 7.0s (98%)
Rx: 5.7 Mbps, 2.4s (10%)
MTU = 1500, -n = 20480
Tx (v4): 92.1 Mbps, 1.1s (8%)
Rx (v4): 92.2 Mbps, 1.6s (12%)
Tx (v6): 91.2 Mbps, 1.1s (8%)
Rx (v6): 90.9 Mbps, 2.6s (22%
UDPv4 tx, 1500 (-n 20480) 90.5 Mbps, 1.6 (64%)
UDPv4 tx, 128 (-n 204800) 34.2 Mbps, 5.2 (99%)
UDPv4 tx, 64 (-n 204800) 17.4 Mbps, 5.1 (99%)
And here are the numbers for hme with GLDv3
HME GLDv3
MTU = 100, -n 2048
Tx: 16.0 Mbps, 7.6s (93%)
Rx: 11.6 Mbps, 1.8s (16%)
MTU = 1500, -n = 20480
Tx (v4): 92.1 Mbps, 1.2s (8%)
Rx (v4): 92.2 Mbps, 3.2s (24%)
Tx (v6): 90.8 Mbps, 0.8 (6%)
Rx (v6): 91.2 Mbps, 4.0s (29%)
UDPv4 tx, 1500 (-n 20480) 89.7 Mbps, 1.5s (60%)
UDPv4 tx, 128 (-n 204800) 29.4 Mbps, 6.0s (99%)
UDPv4 tx, 64 (-n 204800) 14.8 Mbps, 6.0s (99%)
So, given these numbers, it appears that either QFE is more efficient (which is possible, but I'm slightly skeptical) or the cost of the extra overhead of some of the GLDv3 support is hurting us. I'm more inclined to believe this. (For example, we have to check to see if the packet is a VLAN tagged packet... those features don't come for free... :-)
What is really interesting, is that the hme GLDv3 work was about 3% better than the old DLPI hme. So clearly there has been more effort invested into qfe.
Interestingly enough, the performance for Rx tiny packets with GLDv3 is better. I am starting to wonder if there is a difference in the bcopy/dvma thresholds.
So one of the questions that C-Team has to answer is, how important are these relatively minor differences in performance. On a faster machine, you'd be unlikely to notice at all. If this performance becomes a gating factor, I might find it difficult to putback the qfe GLDv3 conversion.
To be completely honest, tracking down the 1-2% difference in performance may not be worthwhile. I'd far rather work on fixing 1-2% gains in the stack than worry about how a certain legacy driver performs.
What are your thoughts? Let me know!
# rem_drv qfe
# update-drv -a -i '"SUNW,qfe"' hme
(This by the way is a nice hack to use GLDv3 features with your qfe cards today if you cannot wait for an official GLDv3 qfe port.)
Anyway, here's what I found out, using my hacked ttcp utility. Note that the times reported are "sys" times.
QFE/DLPI
MTU = 100, -n 2048
Tx: 18.3 Mbps, 7.0s (98%)
Rx: 5.7 Mbps, 2.4s (10%)
MTU = 1500, -n = 20480
Tx (v4): 92.1 Mbps, 1.1s (8%)
Rx (v4): 92.2 Mbps, 1.6s (12%)
Tx (v6): 91.2 Mbps, 1.1s (8%)
Rx (v6): 90.9 Mbps, 2.6s (22%
UDPv4 tx, 1500 (-n 20480) 90.5 Mbps, 1.6 (64%)
UDPv4 tx, 128 (-n 204800) 34.2 Mbps, 5.2 (99%)
UDPv4 tx, 64 (-n 204800) 17.4 Mbps, 5.1 (99%)
And here are the numbers for hme with GLDv3
HME GLDv3
MTU = 100, -n 2048
Tx: 16.0 Mbps, 7.6s (93%)
Rx: 11.6 Mbps, 1.8s (16%)
MTU = 1500, -n = 20480
Tx (v4): 92.1 Mbps, 1.2s (8%)
Rx (v4): 92.2 Mbps, 3.2s (24%)
Tx (v6): 90.8 Mbps, 0.8 (6%)
Rx (v6): 91.2 Mbps, 4.0s (29%)
UDPv4 tx, 1500 (-n 20480) 89.7 Mbps, 1.5s (60%)
UDPv4 tx, 128 (-n 204800) 29.4 Mbps, 6.0s (99%)
UDPv4 tx, 64 (-n 204800) 14.8 Mbps, 6.0s (99%)
So, given these numbers, it appears that either QFE is more efficient (which is possible, but I'm slightly skeptical) or the cost of the extra overhead of some of the GLDv3 support is hurting us. I'm more inclined to believe this. (For example, we have to check to see if the packet is a VLAN tagged packet... those features don't come for free... :-)
What is really interesting, is that the hme GLDv3 work was about 3% better than the old DLPI hme. So clearly there has been more effort invested into qfe.
Interestingly enough, the performance for Rx tiny packets with GLDv3 is better. I am starting to wonder if there is a difference in the bcopy/dvma thresholds.
So one of the questions that C-Team has to answer is, how important are these relatively minor differences in performance. On a faster machine, you'd be unlikely to notice at all. If this performance becomes a gating factor, I might find it difficult to putback the qfe GLDv3 conversion.
To be completely honest, tracking down the 1-2% difference in performance may not be worthwhile. I'd far rather work on fixing 1-2% gains in the stack than worry about how a certain legacy driver performs.
What are your thoughts? Let me know!
Subscribe to:
Posts (Atom)