[clue] [tech] Filesystem corruption with VMWare iSCSI initiator and block device translation
Jim Ockers
ockers at ockers.net
Tue Nov 20 12:34:47 MST 2012
Hi CLUEbies,
We had a major filesystem corruption event and I was wondering if anyone
else had experienced something like this or if there is some
good/obvious reason why it happened.
We have a Windows 2003 (NTFS5) data volume (not the OS volume) on an
iSCSI target on a Linux OpenFiler, with Windows running under VMWare
ESXi5. In order to give the Windows VM access to the iSCSI target
volume there are 3 ways to do it:
1. Boot the OS in the usual way for its VM, and use the Microsoft
iSCSI initiator to access the target. The OS via its own
initiator finds a NTFS5 filesystem and assigns it a drive letter
as usual.
2. Configure VMWare to access the target using its iSCSI initiator,
and then configure the VM with the _*raw mapped LUN*_ as another
disk drive. The OS finds a VMWare virtual disk, and finds a NTFS5
filesystem on the disk. VMWare handles the block device
translation between a virtual disk and an iSCSI target, and the OS
has no knowledge that the actual block device is an iSCSI target.
3. Configure VMWare to access the target using its iSCSI initiator,
and mount the target as a VMWare datastore using VMFS5
filesystem. In the datastore there would be a VMWare VMDK virtual
disk, and the VM has this VMDK as one of its disk drives. The OS
would then see a normal VMWare virtual disk and has no knowledge
of VMFS5 datastore or iSCSI.
We first tried a raw mapped LUN, and things were fine for 2 or 3 days
and then we started getting massive NTFS data corruption, but no
indication was given other than Windows event viewer ntfs errors.
Because the system didn't crash, it ran for over a day like this, and
the backups got corrupted too. CHKDSK made matters worse. We wound up
having to merge two backups together because there were inconsistencies
that required manual resolution. What a pain.
We switched to using the Microsoft iSCSI initiator to access the volume,
and it's been fine for a few days now with no NTFS errors or corruption
or data loss that we know of.
The VMDK on VMFS5 datastore on iSCSI is also problem-free as far as we
can tell.
I was wondering if anyone on this list had any ideas or wild speculation
about why using the VMWare iSCSI initiator and giving the iSCSI target
to the OS as a raw mapped LUN would cause filesystem corruption, whereas
the other 2 options are both trouble-free? Is there some good reason
why the raw mapped LUN approach is not recommended? Is it only bad for
iSCSI or is it also bad for fiber channel etc?
Obviously we won't be doing this again but I wish I had some good
reasons for why it was so problematic.
Thanks,
Jim
--
Jim Ockers, P.E., P.Eng. (ockers at ockers.net)
Contact info: http://www.ockers.net/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://cluedenver.org/pipermail/clue/attachments/20121120/7e65e0ba/attachment.html
More information about the clue
mailing list