Friday, April 20, 2007

GLDv3 experiences

I've just finished (still testing!) my port of eri to GLDv3. Between that and eri, and looking at existing GLDv3 drivers (bge, rge, e1000g), I think I have gathered some operational experience that I hope we can use to improve Nemo. (So, anyone who says my time spent on converting eri was wasted is wrong... because if nothing else it gained some more operational experience with GLDv3.)

Executive summary of the takeaways I have gotten so far, that I think are worth noting:

  • There is still a lot of code duplicated across even GLDv3 drivers (more below)
  • Lock management is so much simplified
  • GLDv3 kstats need "work"
  • we really, really need Brussels... it can't come soon enough.
  • some drivers can probably be changed internally to work even better with GLDv3 than a naive port

So here's the detailed stuff.

  1. code duplication

    The duplicated code falls into three major areas. ioctls (mostly ndd(1M) and loopback handling for SunVTS), kstats, and MII. For now I want to focus on the MII bit. It turns out that pretty much every Ethernet device on the planet talks to a transceiver (whether integrated into the same chip as the MAC controller or not) using MII/GMII. We have tons of logic surrounding MII and GMII replicated across each driver, and frequently the decisions made by one driver are different than those in another.

    There exists an old i386 driver called mii, which was an abortive attempt to create a common module/framework for MII and PHY handling. (Only used by the obsolete dnet driver at present.) I think this should be revived. Its been shown to work well for BSD Unix (at least NetBSD, but I'm pretty sure all of them), and it would really help simplify a lot of code. The eri driver, for example, probably has a couple thousand lines of MII related auto-negotiation logic in it.

    And of course, each of these negotiation frameworks takes a slightly different set of tunables and configuration parameters, exports different statistics, etc.

  2. Lock management is so much simplified

    It's reallyeasy to write a GLDv3 driver that doesn't hold locks across GLDv3 routines. I suspect a lot of deadlocks/hangs/panics are going to be solved by moving drivers to GLDv3. (Of course, we've seen locking problems higher in the stack as a result... see recent deadlocks in dls, etc. But we only need to solve those once with GLDv3. Yay.)

  3. The kstat framework for GLDv3 is just plain broken.

    There are several problems here.

    • All kstats for a media type are included, regardless of whether or not they make sense for a specific device. For example, the cap_rem_fault is not supported by most of the drivers yet, but yet, when the driver doesn't have support in mac_stat(), the statistic is included in kstat output as 0. However, pretty much any system with an 802.3u compliant MII does in fact support the rem_fault MII field. So in this case, just because the driver isn't exporting the stat, the framework is creating an outright lie. This is probably true of other stats as well. For example, if hardware isn't prepared to report runt_errors, then it doesn't make sense to claim that value as "zero".... because you might be flooding the device with bad packets, which just get dropped on the floor (perhaps getting accounted in some other, less granular "BadPackets" counter or somesuch.) Better to say nothing than to tell a lie, IMO.

    • kstat's are normally "snapshotted", so that you can take a snapshot of all stats in time at once. This is common with some hardware devices, too. Getting these stats may be expensive though. (For example reclaiming transmit buffers, so you can collect transmit status, etc. Acquiring locks. With some devices you might even have to do an expensive collection effort that would normally cv_wait for an interrupt.) Having to go through this several times (once for each stat collected) for a single snapshot is ... inefficient. It would be nice to add a mac_stat_update() entry point, which is separate from the mac_stat() entry point. (Even better, also add a mac_stat_done() to release any resources acquired by the first call.) The good news, I think, is that hopefully we aren't going to have to support DLPI DL_GET_STATISTICS_REQ, so it should be safe to cv_wait in mac_stat() related calls now (unlike with older GLDv2.) We aren't supporting the DLPI statistics calls, are we? Please say we aren't....

    • If the driver wants to export any additional driver-specific statistics, it has to do the whole kstat dance itself, in addition to the nemo mac_stat() entry point. Lets try to find a way for drivers to export/register additional driver specific kstats within the existing nemo framework, please?

    • Duplication. E.g. for bge, there is a "bge0" kstat, created by dls, as well as a "mac" kstat created by the mac module. Both of these will have some common counters, like ipackets64, brdcstxmt, etc. What's worse, one stat in particular, "unknowns" is counted by the dls framework in the "bge0" stat, but is not counted by the "mac" stat. This can lead to confusion. The duplication also makes worse the snapshot problem already mentioned, since it appears that most of the stats are generated just by calling the mac_stat() a second time for the same values already recorded in the "mac" kstat.

    • Inadequate list of kstats in the default set. I found several kstat which were missing. We got several of them getting fixed as a result of PSARC 2007/220, but I've since found a few others. E.g. Ethernet devices commonly can detect "jabber timeouts". These should be reported somehow. Also, stats about network related interrupts are really important, and aren't included by default. I consider this a significant shortcoming. I guess devices should register a KSTAT_TYPE_INTR kstat, but approximately none of them do today.

    • Stat cleanups in drivers. This is mostly a driver-specific problem, but look at the kstat output on bge and e1000g, and see what I'm talking about. There is a total lack of consistency here.

  4. We really need Brussels.

    From the above, you see the problems with kstats. There are similar problems with NDD. The amount of code scattered around different drivers trying to figure out NIC tuning is boggling. And most of it isn't what you'd call "sterling examples of quality". The eri driver was full of some really, really fragile code in this. (Deleting one tunable ... the instance ndd parameter... required updating no fewer than 4 different locations in the driver. And they weren't conveniently co-located.

    Interpretation of values, handling, all of it is terribly replicated across so many drivers. I can't wait to eradicate this crufty, horrid code, and replace it with something nice and sane from Brussels.

  5. Some drivers can change internally to work even better with GLDv3.

    In eri, for example, I think we can be smart on the transmit side, so that, for example, when a group of mblks comes down, we don't kick the hardware and resync the descriptor rings until all the packets are queued for transmit. This would help amortize some per-packet expenses across multiple packets.

    Other drivers can benefit from multiaddress support. dmfe falls into that category.

    That said, my approach so far has been the naive conversion. I'd like to revisit a few of them to enhance them to take advantage of the superior design in GLDv3, but first I want to get them put back.


James said...

Any idea when it'll be safe for 3rd party nic driver developers to deliver gldv3 drivers?

Garrett D'Amore said...

Not yet. There are certainly some further changes to the GLDv3 coming, in particular to enable the IP stack to "poll" the driver receive packets (checkout the crossbow gate). Then there are changes for Brussels coming, as well.

Stay tuned, is all I can say right now.