As predicted, the area of biggest risk in my conversion of eri to GLDv3 was in fact the kstat handling. However, I appear to have that all worked out now, and the binary is working flawlessly on my SunBlade 100. Even suspend/resume works fine. However, I've not yet integrated this code properly into a workspace to generate a webrev, but I will do so soon. (Probably tomorrow... I'd like to get my two other RTIs put back first.)
One of the biggest concerns about this effort was the added risk that doing this conversion might bring to the "stable" eri driver. So, I'm asking the community for help. If you want to help out with testing, especially if you have higher end systems or want to do some benchmark comparisons, please let me know.
(I don't have specific test suites to give out that this time... its of more value frankly to have people using their own tests right now, that way we get broader test coverage than perhaps we might with a single test suite.)
Please let me know. Thanks! (Oh yeah, if you have an eri you want to try with new GLDv3-based 802.3ad link aggregation features, I'd be game for that, too!)
(PS. An obvious consequence of this effort is that it will be easy to do the work to convert hme, gem, and qfe, which share a lot common heritage with the eri driver. So, maybe there is yet hope for those, as well.)
Sunday, April 22, 2007
Friday, April 20, 2007
GLDv3 experiences
I've just finished (still testing!) my port of eri to GLDv3. Between that and eri, and looking at existing GLDv3 drivers (bge, rge, e1000g), I think I have gathered some operational experience that I hope we can use to improve Nemo. (So, anyone who says my time spent on converting eri was wasted is wrong... because if nothing else it gained some more operational experience with GLDv3.)
Executive summary of the takeaways I have gotten so far, that I think are worth noting:
So here's the detailed stuff.
Executive summary of the takeaways I have gotten so far, that I think are worth noting:
- There is still a lot of code duplicated across even GLDv3 drivers (more below)
Lock management is so much simplified - GLDv3 kstats need "work"
- we really, really need Brussels... it can't come soon enough.
- some drivers can probably be changed internally to work even better with GLDv3 than a naive port
So here's the detailed stuff.
- code duplication
The duplicated code falls into three major areas. ioctls (mostly ndd(1M) and loopback handling for SunVTS), kstats, and MII. For now I want to focus on the MII bit. It turns out that pretty much every Ethernet device on the planet talks to a transceiver (whether integrated into the same chip as the MAC controller or not) using MII/GMII. We have tons of logic surrounding MII and GMII replicated across each driver, and frequently the decisions made by one driver are different than those in another.
There exists an old i386 driver called mii, which was an abortive attempt to create a common module/framework for MII and PHY handling. (Only used by the obsolete dnet driver at present.) I think this should be revived. Its been shown to work well for BSD Unix (at least NetBSD, but I'm pretty sure all of them), and it would really help simplify a lot of code. The eri driver, for example, probably has a couple thousand lines of MII related auto-negotiation logic in it.
And of course, each of these negotiation frameworks takes a slightly different set of tunables and configuration parameters, exports different statistics, etc. - Lock management is so much simplified
It's reallyeasy to write a GLDv3 driver that doesn't hold locks across GLDv3 routines. I suspect a lot of deadlocks/hangs/panics are going to be solved by moving drivers to GLDv3. (Of course, we've seen locking problems higher in the stack as a result... see recent deadlocks in dls, etc. But we only need to solve those once with GLDv3. Yay.) - The kstat framework for GLDv3 is just plain broken.
There are several problems here.- All kstats for a media type are included, regardless of whether or not they make sense for a specific device. For example, the cap_rem_fault is not supported by most of the drivers yet, but yet, when the driver doesn't have support in mac_stat(), the statistic is included in kstat output as 0. However, pretty much any system with an 802.3u compliant MII does in fact support the rem_fault MII field. So in this case, just because the driver isn't exporting the stat, the framework is creating an outright lie. This is probably true of other stats as well. For example, if hardware isn't prepared to report runt_errors, then it doesn't make sense to claim that value as "zero".... because you might be flooding the device with bad packets, which just get dropped on the floor (perhaps getting accounted in some other, less granular "BadPackets" counter or somesuch.) Better to say nothing than to tell a lie, IMO.
- kstat's are normally "snapshotted", so that you can take a snapshot of all stats in time at once. This is common with some hardware devices, too. Getting these stats may be expensive though. (For example reclaiming transmit buffers, so you can collect transmit status, etc. Acquiring locks. With some devices you might even have to do an expensive collection effort that would normally cv_wait for an interrupt.) Having to go through this several times (once for each stat collected) for a single snapshot is ... inefficient. It would be nice to add a mac_stat_update() entry point, which is separate from the mac_stat() entry point. (Even better, also add a mac_stat_done() to release any resources acquired by the first call.) The good news, I think, is that hopefully we aren't going to have to support DLPI DL_GET_STATISTICS_REQ, so it should be safe to cv_wait in mac_stat() related calls now (unlike with older GLDv2.) We aren't supporting the DLPI statistics calls, are we? Please say we aren't....
- If the driver wants to export any additional driver-specific statistics, it has to do the whole kstat dance itself, in addition to the nemo mac_stat() entry point. Lets try to find a way for drivers to export/register additional driver specific kstats within the existing nemo framework, please?
- Duplication. E.g. for bge, there is a "bge0" kstat, created by dls, as well as a "mac" kstat created by the mac module. Both of these will have some common counters, like ipackets64, brdcstxmt, etc. What's worse, one stat in particular, "unknowns" is counted by the dls framework in the "bge0" stat, but is not counted by the "mac" stat. This can lead to confusion. The duplication also makes worse the snapshot problem already mentioned, since it appears that most of the stats are generated just by calling the mac_stat() a second time for the same values already recorded in the "mac" kstat.
- Inadequate list of kstats in the default set. I found several kstat which were missing. We got several of them getting fixed as a result of PSARC 2007/220, but I've since found a few others. E.g. Ethernet devices commonly can detect "jabber timeouts". These should be reported somehow. Also, stats about network related interrupts are really important, and aren't included by default. I consider this a significant shortcoming. I guess devices should register a KSTAT_TYPE_INTR kstat, but approximately none of them do today.
- Stat cleanups in drivers. This is mostly a driver-specific problem, but look at the kstat output on bge and e1000g, and see what I'm talking about. There is a total lack of consistency here.
- All kstats for a media type are included, regardless of whether or not they make sense for a specific device. For example, the cap_rem_fault is not supported by most of the drivers yet, but yet, when the driver doesn't have support in mac_stat(), the statistic is included in kstat output as 0. However, pretty much any system with an 802.3u compliant MII does in fact support the rem_fault MII field. So in this case, just because the driver isn't exporting the stat, the framework is creating an outright lie. This is probably true of other stats as well. For example, if hardware isn't prepared to report runt_errors, then it doesn't make sense to claim that value as "zero".... because you might be flooding the device with bad packets, which just get dropped on the floor (perhaps getting accounted in some other, less granular "BadPackets" counter or somesuch.) Better to say nothing than to tell a lie, IMO.
- We really need Brussels.
From the above, you see the problems with kstats. There are similar problems with NDD. The amount of code scattered around different drivers trying to figure out NIC tuning is boggling. And most of it isn't what you'd call "sterling examples of quality". The eri driver was full of some really, really fragile code in this. (Deleting one tunable ... the instance ndd parameter... required updating no fewer than 4 different locations in the driver. And they weren't conveniently co-located.
Interpretation of values, handling, all of it is terribly replicated across so many drivers. I can't wait to eradicate this crufty, horrid code, and replace it with something nice and sane from Brussels. - Some drivers can change internally to work even better with GLDv3.
In eri, for example, I think we can be smart on the transmit side, so that, for example, when a group of mblks comes down, we don't kick the hardware and resync the descriptor rings until all the packets are queued for transmit. This would help amortize some per-packet expenses across multiple packets.
Other drivers can benefit from multiaddress support. dmfe falls into that category.
That said, my approach so far has been the naive conversion. I'd like to revisit a few of them to enhance them to take advantage of the superior design in GLDv3, but first I want to get them put back.
Wednesday, April 18, 2007
dmfe crossbow conversion
In case you ever wondered what it takes to convert a "simple" GLDv2 driver to Nemo, have a look at the webrev I posted earlier today.
I'm hoping that this work will get integrated soon. As an upshot, dmfe with this change "just works" with dladm show-dev.
I'm hoping that this work will get integrated soon. As an upshot, dmfe with this change "just works" with dladm show-dev.
report from the battery team
I'm now a member of the "battery team". I had a very productive con-call with the folks involved, and I think we are going to soon have a better common framework for battery APIs in the kernel so that SPARC systems can also take advantage of the gnome battery applet. Watch this space!
afe integration web rev posted
For the curious, I've posted a webrev containing the changes required to integrate afe into Nevada.
The driver includes changes from the stock AFE driver for Solaris, including some lint fixes, and changes to use the stock Solaris sys/miireg.h.
I'd love to make more changes to this driver, but at the moment I don't want to cause a test reset. Once the driver is integrated, I have a bunch more improvements coming... Nemo, multiple mac address support, VLAN support, link notification support (needed for NWAM), as well as code reduction by using some features that are now part of stock Solaris (like the common MII framework!)
The driver includes changes from the stock AFE driver for Solaris, including some lint fixes, and changes to use the stock Solaris sys/miireg.h.
I'd love to make more changes to this driver, but at the moment I don't want to cause a test reset. Once the driver is integrated, I have a bunch more improvements coming... Nemo, multiple mac address support, VLAN support, link notification support (needed for NWAM), as well as code reduction by using some features that are now part of stock Solaris (like the common MII framework!)
Thursday, April 12, 2007
Tadpole SPARCLE support putback
Core support for SPARCLE was just putback! I'm getting ready to post an initial tadpmu for public review soon, as well. This should make you SPARCLE/Sun Ultra 3 owners out there happy.
Wednesday, April 11, 2007
Not All Broadcom GigE's are Equal
Recently, I posted a blog entry where I described that "Not All GigE Are Equal", strongly advocating the use of Broadcom GigE devices when faced with a choice.
However, after spending time in the code, I've discovered that there is quite a range of differences amongst Broadcom gigE devices.
I had considered listing a full table of them, but it seems that would be a bit onerous. Take a look at usr/src/uts/common/io/bge/bge_chip2.c if you want to find out the gory details. But in the mean time, here are my recommendations:
If you have PCI or PCI-X: Choose a bcm5704 if you can. It has pretty much full feature support, but you need to pick a recent revision (newer than A0.) Look for pci ids of pci14e4,1646, pci14e4,16a8, or pci14e4,1649. These chips alls support PCI-X, multiple rings, full checksum offload, and multiple hardware tx and rx rings.
If you have PCIe: As far as I can tell, all of the PCIe chips that are Solaris supported lack support for multiple hardware tx/rx rings. This is really unfortunate, as it will have a negative impact on Crossbow benefits. But apart from that, it looks like the 5714 and 5714 series are your best bet. They both support jumbo frames, and they both have full checksum offload support. Look for pci ids of pci14e4,1668, pci14e4,1669, pci14e4,1678, or pci14e4,1679.
What this really says, is if you have to choose between a PCI-X card and a PCIe card, surprisingly, choose the PCI-X card (if you can get a 5704). Save your PCIe for framebuffers or HBAs. (Or, better, 10G cards like Neptune.)
However, after spending time in the code, I've discovered that there is quite a range of differences amongst Broadcom gigE devices.
I had considered listing a full table of them, but it seems that would be a bit onerous. Take a look at usr/src/uts/common/io/bge/bge_chip2.c if you want to find out the gory details. But in the mean time, here are my recommendations:
If you have PCI or PCI-X: Choose a bcm5704 if you can. It has pretty much full feature support, but you need to pick a recent revision (newer than A0.) Look for pci ids of pci14e4,1646, pci14e4,16a8, or pci14e4,1649. These chips alls support PCI-X, multiple rings, full checksum offload, and multiple hardware tx and rx rings.
If you have PCIe: As far as I can tell, all of the PCIe chips that are Solaris supported lack support for multiple hardware tx/rx rings. This is really unfortunate, as it will have a negative impact on Crossbow benefits. But apart from that, it looks like the 5714 and 5714 series are your best bet. They both support jumbo frames, and they both have full checksum offload support. Look for pci ids of pci14e4,1668, pci14e4,1669, pci14e4,1678, or pci14e4,1679.
What this really says, is if you have to choose between a PCI-X card and a PCIe card, surprisingly, choose the PCI-X card (if you can get a 5704). Save your PCIe for framebuffers or HBAs. (Or, better, 10G cards like Neptune.)
Subscribe to:
Posts (Atom)