Wednesday, July 7, 2010

ZFS disk monitoring...

So I've posted this on zfs-discuss at opensolaris dot org, but its been suggested I mention it here too.

It turns out that the ZFS/FMA integration doesn't pick up on drive removals for most disk devices until the filesystem attempts to perform some I/O to the drive. This is rather unfortunate, because if a file system is not busy, you might suffer a loss of redundancy and not find out about it until too late.

It also means that you won't know about failures of hot spare devices until you need to put them into service, since by definition they are idle. (Note: as an exception running periodic scrubs should detect this too, although scrubs are highly intrusive to the overall I/O load on the system and probably should not be performed too often as a result.)

I'm told the Oracle 7000 series appliances have a solution for this problem, but of course the source for that is not in OpenSolaris. (Apparently there are quite a few differences in the core OS between the 7000 series and vanilla OpenSolaris -- unfortunately we can't know because -- unlike with NexentaStor -- we don't have access to the kernel source tree!)

This is not good for folks who use ZFS with ordinary Solaris 10 or OpenSolaris... or with derivatives such as NexentaStor.

To address that problem, I've developed a some code called "zfs-monitor" that periodically monitors the health of any physical vdev (disk) that is part of a ZFS pool (hot spare, log, or real device). This code is implemented as an FMA module. When a disk goes offline, zfs-monitor detects it, and triggers an FMA event, which allows ZFS to do the right thing. This means if a disk goes away, even if it isn't in use, whatever action is appropriate will be performed. (Logged in FMA fault logs, and if appropriate, a hot spare will be recruited to replace the failed or offline device.)

This code is part of NexentaStor 3.0.3. As there are some semantic differences of opinion (what constitutes device failure versus intentional removal by an administrator), the code is unlikely to be pushed into ON without further change. (At the same time, I've fixed a different problem in the ZFS FMRI parsing code, and I've submitted a request to get that fix integrated -- but I've not heard back from anyone at Oracle who is willing to sponsor the change yet.)

I'm happy to share the code for zfs-monitor to anyone who requests it. (In fact, you can examine the code in our open Mercurial repository directly!) Note that for it to work properly, you also will need the fix for the ZFS FMRI parsing bug just mentioned.

At Nexenta, we're committed to innovating and improving upon the great foundation of ZFS and OpenSolaris, and to the reasonable extent possible, we want to share those innovations with the greater OpenSolaris community. Hopefully changes like this demonstrate this commitment in a tangible fashion.


grant said...

very cool. will this also make its way into NCP?

Garrett D'Amore said...

Yes. In fact, NexentaStor is based on NCP -- the nexenta-gate I referenced in the post is used as the source for both NCP and nexenta-gate. There are some NexentaStor additions (the storage management software!) and customizations, and the timing of release of bits may differ from NCP slightly, but NS is built on top of NCP.

maetu said...

Garret - this is great! :-)