O_SYNC behavior not honored
UPDATE (6/21/2010): This problem is apparently solved in b142. Probably other builds as well. But I was unable to reproduce this problem with real hardware on b142.
Note that VMware does not honor cache flushing, so VMware (and possibly other v12n users) will potentially still see this issue.
I've not yet figured out which change introduced the bug, but I hope to work on it next week.
In the meantime, I would strongly discourage use of post-134 binaries for anything where data integrity is important.
I've filed a P1 bug with Oracle for this issue. I'll be trying to nail it down further next week; if I'm able to fix it before Oracle can, I'll offer up my fix.
I'll post the CR number when I receive the number back.
I imagine that this bug, which is trivially reproducible, will be getting top priority from the ZFS engineers next week.
UPDATE: CR number is 6958848
The link to access it isn't available yet.
Comments
Maybe if you guys are 134+minimal backports it'd be less crappy?
Thank you for your blog! It's great! :-)
As for the zfs data loss I am now worried with the possibility of losing my data. I'm using OpenSolaris b134... How realiable is it? Should I detach one disk from my mirror and copy everuthing to in by using another filesystem?
Keep up the great work!
Thank you!
If you have redundancy in your pool, or you don't depend on synchronous writes, then you shouldn't have problems going even beyond 134.
I'm sure the bug will be fixed soon.
Thank you so much for your reply! :-)
BR,
Marcos.
PSARC/2010/108] zil synchronicity
zfs datasets now have a new 'sync' property to control synchronous behaviour.
The zil_disable tunable to turn synchronous requests into asynchronous requests (disable
the ZIL) has been removed. For systems that use that switch on upgrade you will now
see a message on booting:
sorry, variable 'zil_disable' is not defined in the 'zfs' module
Please update your system to use the new sync property.
I've also used a stock installation of b142-ish bits, and I never used the zil_disable option.
I've already explained this to the author of the zil synchronicity changes, because I originally suspected that this was somehow involved.
You could try to look for DKIOCFLUSHWRITECACHE inside a VM and on the host OS as well. It might be that VMWARE is not passing them thru or something like this.
So, VMware == evil for sync. And, apparently, we have a bug in Nexenta that I'll have to track down. I wonder now if the ZIL synchronicity fix from Robert addresses this problem.
Nexenta folks, watch for this in a future release. :-)
I see that Nexenta now has a backport fix for this in 3.04 -> (ZFS O_SYNC and O_DSYNC lies (silent data loss) (backport: 6958848).
I am trying to put 2 and 2 together here because I also noticed this post on the nexenta forum for a gentleman that recently upgraded to 3.04 --> http://www.nexenta.com/corp/forum?func=view&catid=6&id=1390
I am a little worried as to whether I should upgrade to 3.0.4 or not. I use Nexenta strictly for our VmWare environment.
Any thoughts?
-Scott
As with any new release, it makes sense to perform your own testing if possible. The community edition (which is free) makes it possible to do this without putting your critical data at risk.
If I have more detail about this corruption, then I can possibly work on getting a fix done for it, if it is confirmed to be a genuine bug.
(Note that we had a number of release candidates of 3.0.4, some of which *did* suffer from various problems.. so it is entirely possible that the poster in question was using an unofficial version of 3.0.4.)
I'll have our guys do some testing on this as well.
-Scott
I'm confused, how could VMWare not honor synchronous IO calls ? As you obviously know, it is critical to make journaling work on any filesystem and to avoid corruption during a crash or outage. Same thing goes for ACID databases. From what I know, VMWare is used in production environments all the time with all sorts of usage scenario, I find it hard to believe that it wouldn't virtualize proper hardware behavior.
Also, I think I remember reading that VirtualBox doesn't honor those out of the box, but needs a config change for that. I find it incredible that this is the default still, I mean this is so important, even for a simple Windows guest: eg when Windows XP runs on FAT32, an improper shutdown (from a simple crash or outage) will trigger a whole file system check on reboot, but with NTFS, since it has a journal, it obviously doesn't need to. But if the VM hypervisor doesn't honor the final (virtualized) hardware block write sync (including write cache flush) then all is lost and you get corruption.
Please enlighten me!
Oh, and while we're at it, one last question: Can ZFS still survive such a scenario without corruption ? (Assuming sync is not honored but at least ordering is) Sure, the applications requesting sync might get hosed, but just from the point of view of the whole ZFS metadata (pool, filesystem, etc), should everything be in a sane consistent state on reboot ?
I'm asking this because of the copy-on-write semantics, and the whole tree organization of ZFS. Basically the last operation that finalize atomically the filesystem update is the uberblock write. The only problem that I can see here right off the bat is that the uberblock is 1024 bytes, not 512, so (on 512 byte / sector HD) it could still have a torn uberblock, though even there, maybe ZFS has a way out too (replicated uberblock, etc.)
Of course, all of this goes out the window if, on top of not honoring sync, the hypervisor *reorders* writes too. Then, there's really nothing that can save you.
ZFS *should* be OK in these situations, although I'm not 100% certain that there isn't an edge case in there somewhere.
At the end of the day, its *really* important that drives (real or emulated) honor cache flushes. If they don't, then all bets are off.
Thanks for your reply.
I completely agree with that, and given how crucial this is, I'm always shocked when I see a critical storage sw layer not taking this as seriously as it should.
Cheers (& happy new year in 2 days!)