Friday, May 25, 2007

GLDv2->GLDv3 conversion notes

I was recently asked to provide some notes about GLDv2 to GLDv3 conversion for NIC drivers. Here's a rough draft of them. (This is cut-n-paste from mail I sent to an intern at Sun... I'm posting them here so the knowledge isn't lost.)

It is really helpful if you don't try to implement 100% of the features of GLDv3 in the first pass. (Some of the existing GLDv3 drivers, such as rge, nge, have incorrectly provided stubs for some functions, so don't use those as references.) Specifically I would not attempt to implement mac_resource allocation (MC_RESOURCES, etc.) or multiaddress support.

You really do need to implement VLAN full frame sizes for MTU if your hardware can do it. Almost all NICs can do this. Sometimes the code to do it isn't in any Solaris driver. My preferred reference for alternate code sources is the NetBSD tree, which has an OpenGrok server for their code at http://opengrok.netbsd.org/ Ask me if you have any question about VLANs. It helps if you have a switch where you can test VLAN frames.

With GLDv3, there is no reset function, so you have to figure out how to put that in attach (for both DDI_ATTACH and DDI_RESUME!), or in the mac_start() function.

With GLDv3 the stats are quite different. Pay attention, and look at the headers to figure it out.

Many GLDv2 drivers don't do the mac_link_update() call. You should add those for NWAM, IPMP, and correct kstat reporting.

You should rip out any attempt to log to the console on link up, down, or carrier errors. See PSARC 2007/298 for details.

GLDv3 wants to operate on mblk_t's that are chained by b_next. Often you can use the old functions from a new function that just walks down, or builds up the list (depending on whether its receive or transmit.)

Pay careful attention to locking. Try not to call GLDv3 functions with locks held. (I have been pressing to allow mac_tx_update and mac_link_update be called with driver locks held. Right now its safe, but I can't seem to get a promise from the Nemo group yet.)

You see all those cyclics in some drivers? Try not to use 'em if you don't have to. What I usually do is cheat and use the on-chip timer if I need some kind of time driven functionality.

GLDv3 never explicitly initializes the physical addresses on the NIC. GLDv2 used to always call gldm_set_mac_addr()... so some drivers expect this. You may need to do that yourself in the mac_start() routine.
Anyway, maybe those notes will save someone somewhere else some effort. Or inspire someone to pick up another driver and convert it.

All these nics...

So I need to JumpStart a new system today... no problem, I'll just stick in a NIC and boot it with my etherboot PXE CDROM. No problem, right?

Well, lets see, first I need a NIC that supports Solaris. Inventorying what I have in my spare hardware today:
  • Netgear GA311, rev A1 (RealTek 8169S-32, unsupported variant of rge)
  • Netgear FA311, rev C1 (Nat-Semi DP83815D, unsupported)
  • Netgear FA310TX, rev-D2 (Lite-On LC82C169, unsupported, see below)
  • 3Com 3CR990 TX-97 (unsupported)
  • D-Link 530TX rev A-1 (dmfe, no x86 support)
  • Zyxel gigE (Via GbE chip, uncertain)
  • Linksys LNE100TX v4.1 (unsupported, yet, see below)
  • Linksys NC100 (unsupported, yet, see below)
  • Macronix MX98715AEC (unsupported, yet, see below)
  • Unbranded RTL8139B (supported, rtls, nevada only)
  • 3Com 3C900-TX (supported, elxl, for now)
Well, at least I was able to find something. Of my 8 spare NICs, two of them have marginal support. (This is only the wired ethernet NICs. I have some WLAN devices as well.)

I guess I have a habit of collecting NICs.

Now, the Linksys boards are going to soon be supported by afe, if Alan ever gets his putback of my driver done. The Macronix board will be supported by mxfe later this week, once I get it reviewed and putback.

At one point I had a driver (pnic) sort of working for the LC82C169 (Lite-On PNIC), but I abandoned it because the PNIC was such a piece of crap, that I figured anyone with one of these was better off throwing it away and replacing it with another NIC (as long as it wasn't a Realtek 8139!) Maybe I'll revive that project one day. Probably not, since Lite-On didn't sell too many of them, I think. (The PNIC has some horrible hardware bugs, and the two major revisions, the 82C169 and 82C168, have quite different methods of handling 802.3u autonegotiation.)

I also started a driver for the Nat-Semi chip (nsfe), but abandoned it. I think this chip is also found in motherboards, where it is called an SiS part. I think Muryama also has a driver available for it.

I'd really like to see support for the others expanded upon. Maybe I need to look at dmfe, some more, because there really shouldn't be any reason it couldn't support x86 platforms. (D-Link sold a lot of DFE-530TX boads, IIRC.)

This also suggests that the elxl driver, which has been slated for EOF, really shouldn't be. One of the reasons I've kept that old NIC around was just because it was one of the few that was supported by Solaris 8 and earlier. I suspect I'm not the only one to have done this. I think the problem is that this driver is not open source. But open source variants exist... maybe someone should look at replacing elxl in Solaris Nevada with a FOSS replacment.

Some of these Muryama has already written drivers for. I would dearly like to see his vel in Solaris Nevada, along with conversion to GLDv3.

Tuesday, May 22, 2007

IP Instances, GLDv3, and mxfe

I recently decided that I wanted to create a zone with an exclusive IP instance, so that I could run IPsec (and specifically "punchin") in it. I have lots of NICs floating around, so I thought it would be trivial.

Turns out that all my NICs were GLDv2, and that IP instances requires GLDv3.

My solution? Conversion of mxfe (which were the cards I have in my system) to GLDv3. I figured it would be easier/faster than going out and buying a new Realtek card. And it would have been if not for one really annoying problem in mcopymsg() (see my previous post for that rant.)

Anyway, mxfe is humming away nicely now as a GLDv3 NIC on my system. I even got VLANs working with full MTU frames. Yay. I filed PSARC 2007/291 today, if you're interested in it. I'll post the driver sources up somewhere shortly.

(On another note, mxfe and afe are "suboptimal" drivers... they just blindly bcopy data, do nothing to reduce tx interrupts, and basically violate all the normal rules for making performant NIC drivers. But they work pretty well, for all that.)

Why Side Effects Are Bad

This entry could just as easily been titled "Why Bad Documentation is Worse Than No Documentation".

I noticed that some functions from strsun.h are now part of the DDI. Great, I thought, I'll update my driver to use them as part of the general GLDv3 cleanup.

One major surprise, which I spent about 5-6 hours figuring out tonight. mcopymsg(9f) has a side effect that isn't documented!

Specifically, it does freemsg() on the buffer passed in.

Don't believe me? Check the source!

The manual page says nothing about this. And reading logically from the name, you'd not think it would do this. The side effect should never have been designed in, in the first place. But if the man page referenced this side effect, I might, just might have caught this problem a couple of hours ago.

In my particular case, it was causing hard hangs most of the time. Until I finally got a panic that pointed me into the root of the problem. (Yes, I probably should have set kmem_flags != 0. Next time.)

/me throws brick at whoever wrote and edited the man page.
/me throws pallet of bricks at whoever designed mcopymsg with this side effect in the first place

Arrgh. Well, this will probably help me figure out several problems I've run into lately.

Sunday, May 13, 2007

ZFS to the rescue!

So this weekend I had to do a system reinstall. Thankfully I had all my data on a pair of sata drives in a ZFS raidz. But I had to totally reinstall my system with a new motherboard, new SATA controller, etc.

I had to redo a bunch of things manually... NIS, passed, DHCP, etc.

The one thing that I didn't have to worry about: ZFS. I just plugged my SATA drives in, and did "zpool import -f data" (my dataset was called "data", which I could have figured out by just doing "zfs import" without options.)

That was it. One command only, and my raidz mirror was back in business, mounted in the right place, and even the right ZFS fileystems were NFS exported with the right options. Thank-you ZFS!


ZFS developers, I owe you a round, or three. Let me know if you want want to collect. :-)

Favorite things in OpenSolaris not in Solaris 10

A few things that I love about OpenSolaris, but that Solaris 10 lacks:

* Xorg default support for Intel GMA 950
* SATA ATAPI device support
* WPA (coming in build 64)
* NWAM (network auto-magic)
* DMFE is GLDv3

I'm not sure which of these will be coming to a Solaris 10 update in the future, but I can tell you I was immensely pleased with my upgrade from Solaris 10 update 4 (on a Intel mobo, with a Core 2 Duo cpu) to Solaris Nevada b62. As part of the deal, I switched to a SATA DVD drive; my system is now entirely SATA (no legacy PATA ribbon cables in the box!)

I'm sure there are lots of other useful features too, but at this point, I've put Nevada into "production" use at home, and I'm not looking back.

(And yes, I realize b62 isn't the latest, but I didn't have a copy of snv_63, and the machine I was "reinstalling" ... thanks to an unexpected mobo replacement, was my network server.)

Learning to hate SMF

Some of you may recall my recent putback of the removal of in.tnamed.

Well, there has been some nasty fallout, thanks to SMF and the upgrade process. snv64 (which will have to be respun as a result of this nastiness) was hanging during upgrade, thanks to chicken and egg dependencies in the upgrade script.

I fixed the hang, but there is a warning message coming from inetd that I can't seem to locate.

Along the way, I've found references to the network/tname service in a few surprising places. The things I've had to edit, thanks to SMF:

usr/src/tools/scripts/bfu.sh
usr/src/pkgdefs/SUNWcsr/postinstall
usr/src/cmd/svc/profile/generic_net_limited.xml
usr/src/cmd/svc/prophist/prophist.SUNWcsr
usr/src/cmd/cmd-inet/usr.sbin/tname.xml (removed)

And *still* we see a warning from inetd. See 6556092 in bugster for more info.

Anyway, I'm waiting for folks to decide whether to allow the warning to stay, or to backout the change to remove in.tnamed. If the later is taken, I will run screaming from the process, and just leave in.tnamed alone.

In my opinion, removing the tname.xml should have been sufficient. But thanks to SMF's binary databases, it creates a major headache. Can someone from the SMF team please unravel this maze?

Friday, May 4, 2007

LSI MegaRaid SAS driver (Thanks dlg!)

David Gwynne (dlg on #opensolaris) has created a very nice driver for LSI MegaRaid SAS controllers. You can find it here.

I have not got any hardware, so I've not tested it, but this driver is the model of simplicity and elegance for an HBA, from what I can tell, weighing in at only 1500 lines. A great deal of that is no doubt thanks to the simple model of the hardware, but the simplicity and elegance in the driver should be credited to David as well.

I'd like to sponsor this myself for integration into Nevada, but I haven't got any hardware. If you have hardware to loan for qualification testing, give me a shout, because this looks like a prime candidate for a Nevada integration.

hme gldv3 status report

The conversion of hme to gldv3 looks like it is a success. The driver "Just Worked" from the first time I loaded it. Yay.

Still to be tested are the main areas of risk: VLANs, SUSPEND/RESUME, and DDI detach. Stay tuned for more on that front.

I'm going to have to preserve qfe as a seperate driver, I think, because renaming/renumbering devices is just going to cause too much grief in the field. But, what I'm going to do is make hme.c and qfe.c very small (say ~50 lines each), and have them use a common misc module to provide the entire functionality.

I have now received several qfe boards as well, so I'll be testing on x86 soon, as well.

Watch this space for the code review to be posted.

For the curious, some size comparisions:

gd78059@sr1-umpk-52{8}> wc pcic/usr/src/uts/sun/io/hme.c{,.orig}
6498 19423 171291 pcic/usr/src/uts/sun/io/hme.c
8889 26403 232421 pcic/usr/src/uts/sun/io/hme.c.orig
15387 45826 403712 total

size in the kernel (as reported by modinfo): old = 63384, new 47184

Wednesday, May 2, 2007

PSARC 2007/243 approved

Subject says it all. This is the eri conversion to nemo. There is more testing yet to be done, but note that this means that eri will inherit VLAN and link aggregation support. Neat, huh?

In the Bay Area this week

I'm up in MPK this week. (Wed, Thur, and leaving Friday night.) If any fellow Solaris geeks up here are up for a pub outing, e-mail me.