Friday, June 4, 2010

O_SYNC behavior not honored

UPDATE (6/21/2010): This problem is apparently solved in b142. Probably other builds as well. But I was unable to reproduce this problem with real hardware on b142.

Note that VMware does not honor cache flushing, so VMware (and possibly other v12n users) will potentially still see this issue.

So, it turns out that ZFS in recent (somewhere after build 134 apparently) builds has a critical bug ... O_SYNC writes are not really synchronous. This leads to potential data loss.

I've not yet figured out which change introduced the bug, but I hope to work on it next week.

In the meantime, I would strongly discourage use of post-134 binaries for anything where data integrity is important.

I've filed a P1 bug with Oracle for this issue. I'll be trying to nail it down further next week; if I'm able to fix it before Oracle can, I'll offer up my fix.

I'll post the CR number when I receive the number back.

I imagine that this bug, which is trivially reproducible, will be getting top priority from the ZFS engineers next week.


UPDATE: CR number is 6958848

The link to access it isn't available yet.

17 comments:

richlowe said...

Sure would be nice if 'hg bisect' took a list of paths, and only cared about revisions matching those paths. As it is, not even sure a bisect is going to be the easiest way to pin down a rev in [134, tip).

Maybe if you guys are 134+minimal backports it'd be less crappy?

Garrett D'Amore said...

What we are is at something between minimal backports and all of tip. Previously they had taken most of the zfs patches; but now we'll be revisiting our patching strategy going forward; nonetheless, the problem exists in Nevada today, so it does need to be fixed for all OpenSolaris users.

marmel.forums said...

Hi Garrett,
Thank you for your blog! It's great! :-)
As for the zfs data loss I am now worried with the possibility of losing my data. I'm using OpenSolaris b134... How realiable is it? Should I detach one disk from my mirror and copy everuthing to in by using another filesystem?
Keep up the great work!
Thank you!

Garrett D'Amore said...

b134 is safe. Its going *beyond* 134 that is risky.

If you have redundancy in your pool, or you don't depend on synchronous writes, then you shouldn't have problems going even beyond 134.

I'm sure the bug will be fixed soon.

marmel.forums said...

Hi Garrett,
Thank you so much for your reply! :-)
BR,
Marcos.

nadkarni said...

You are sure you are not getting hit because of this putback:
PSARC/2010/108] zil synchronicity

zfs datasets now have a new 'sync' property to control synchronous behaviour.
The zil_disable tunable to turn synchronous requests into asynchronous requests (disable
the ZIL) has been removed. For systems that use that switch on upgrade you will now
see a message on booting:

sorry, variable 'zil_disable' is not defined in the 'zfs' module

Please update your system to use the new sync property.

Garrett D'Amore said...

I'm sure ... I'm using a kernel without that change in it, and it suffers the same problem.

I've also used a stock installation of b142-ish bits, and I never used the zil_disable option.

I've already explained this to the author of the zil synchronicity changes, because I originally suspected that this was somehow involved.

richlowe said...

Yeah, the comment regarding all of tip v. backports was merely regarding the number of revisions 'hg bisect' would need to search, if you did it that way.

milek said...

Neil tried to reproduce the problem with no success. However he noticed that the test Garret did was within vmware... so it might be that the issue has nothing to do with ZFS.

You could try to look for DKIOCFLUSHWRITECACHE inside a VM and on the host OS as well. It might be that VMWARE is not passing them thru or something like this.

Garrett D'Amore said...

I've seen this problem on real hardware; however the real hardware was running a custom Nexenta kernel which was derived from 134 plus a bunch of assorted ZFS patches. One of the things I'm going to do today is see if I can reproduce with 142 on the real hardware, since I realize that results obtained with the Nexenta kernel are basically useless.

Garrett D'Amore said...

Okay, I've just tested b142 on the same real hardware that I encountered it with the Nexenta hardware. The good news: the problem does not reproduce!

So, VMware == evil for sync. And, apparently, we have a bug in Nexenta that I'll have to track down. I wonder now if the ZIL synchronicity fix from Robert addresses this problem.

Nexenta folks, watch for this in a future release. :-)

ScottL said...

Hi Garrett,

I see that Nexenta now has a backport fix for this in 3.04 -> (ZFS O_SYNC and O_DSYNC lies (silent data loss) (backport: 6958848).

I am trying to put 2 and 2 together here because I also noticed this post on the nexenta forum for a gentleman that recently upgraded to 3.04 --> http://www.nexenta.com/corp/forum?func=view&catid=6&id=1390

I am a little worried as to whether I should upgrade to 3.0.4 or not. I use Nexenta strictly for our VmWare environment.

Any thoughts?

-Scott

Garrett D'Amore said...

I've not seen the problem that was reported there, and this is the first I've heard about it. I can tell you that we did test VMware in our environment.

As with any new release, it makes sense to perform your own testing if possible. The community edition (which is free) makes it possible to do this without putting your critical data at risk.

If I have more detail about this corruption, then I can possibly work on getting a fix done for it, if it is confirmed to be a genuine bug.

(Note that we had a number of release candidates of 3.0.4, some of which *did* suffer from various problems.. so it is entirely possible that the poster in question was using an unofficial version of 3.0.4.)

I'll have our guys do some testing on this as well.

ScottL said...

Thanks Garrett, I appreciate the info and your time!

-Scott

Stephane said...

Garrett,

I'm confused, how could VMWare not honor synchronous IO calls ? As you obviously know, it is critical to make journaling work on any filesystem and to avoid corruption during a crash or outage. Same thing goes for ACID databases. From what I know, VMWare is used in production environments all the time with all sorts of usage scenario, I find it hard to believe that it wouldn't virtualize proper hardware behavior.

Also, I think I remember reading that VirtualBox doesn't honor those out of the box, but needs a config change for that. I find it incredible that this is the default still, I mean this is so important, even for a simple Windows guest: eg when Windows XP runs on FAT32, an improper shutdown (from a simple crash or outage) will trigger a whole file system check on reboot, but with NTFS, since it has a journal, it obviously doesn't need to. But if the VM hypervisor doesn't honor the final (virtualized) hardware block write sync (including write cache flush) then all is lost and you get corruption.

Please enlighten me!

Oh, and while we're at it, one last question: Can ZFS still survive such a scenario without corruption ? (Assuming sync is not honored but at least ordering is) Sure, the applications requesting sync might get hosed, but just from the point of view of the whole ZFS metadata (pool, filesystem, etc), should everything be in a sane consistent state on reboot ?

I'm asking this because of the copy-on-write semantics, and the whole tree organization of ZFS. Basically the last operation that finalize atomically the filesystem update is the uberblock write. The only problem that I can see here right off the bat is that the uberblock is 1024 bytes, not 512, so (on 512 byte / sector HD) it could still have a torn uberblock, though even there, maybe ZFS has a way out too (replicated uberblock, etc.)

Of course, all of this goes out the window if, on top of not honoring sync, the hypervisor *reorders* writes too. Then, there's really nothing that can save you.

Garrett D'Amore said...

I don't know how VMware justifies it... I just know that the behavior is what I observed.

ZFS *should* be OK in these situations, although I'm not 100% certain that there isn't an edge case in there somewhere.

At the end of the day, its *really* important that drives (real or emulated) honor cache flushes. If they don't, then all bets are off.

Stephane said...

Hi Garrett,

Thanks for your reply.

I completely agree with that, and given how crucial this is, I'm always shocked when I see a critical storage sw layer not taking this as seriously as it should.

Cheers (& happy new year in 2 days!)