[clue] [tech] Filesystem corruption with VMWare iSCSI initiator and block device translation

Jim Ockers ockers at ockers.net
Tue Nov 20 12:34:47 MST 2012


Hi CLUEbies,

We had a major filesystem corruption event and I was wondering if anyone 
else had experienced something like this or if there is some 
good/obvious reason why it happened.

We have a Windows 2003 (NTFS5) data volume (not the OS volume) on an 
iSCSI target on a Linux OpenFiler, with Windows running under VMWare 
ESXi5.  In order to give the Windows VM access to the iSCSI target 
volume there are 3 ways to do it:

   1. Boot the OS in the usual way for its VM, and use the Microsoft
      iSCSI initiator to access the target.  The OS via its own
      initiator finds a NTFS5 filesystem and assigns it a drive letter
      as usual.
   2. Configure VMWare to access the target using its iSCSI initiator,
      and then configure the VM with the _*raw mapped LUN*_ as another
      disk drive.  The OS finds a VMWare virtual disk, and finds a NTFS5
      filesystem on the disk.  VMWare handles the block device
      translation between a virtual disk and an iSCSI target, and the OS
      has no knowledge that the actual block device is an iSCSI target.
   3. Configure VMWare to access the target using its iSCSI initiator,
      and mount the target as a VMWare datastore using VMFS5
      filesystem.  In the datastore there would be a VMWare VMDK virtual
      disk, and the VM has this VMDK as one of its disk drives.  The OS
      would then see a normal VMWare virtual disk and has no knowledge
      of VMFS5 datastore or iSCSI.


We first tried a raw mapped LUN, and things were fine for 2 or 3 days 
and then we started getting massive NTFS data corruption, but no 
indication was given other than Windows event viewer ntfs errors.  
Because the system didn't crash, it ran for over a day like this, and 
the backups got corrupted too.  CHKDSK made matters worse.  We wound up 
having to merge two backups together because there were inconsistencies 
that required manual resolution. What a pain.

We switched to using the Microsoft iSCSI initiator to access the volume, 
and it's been fine for a few days now with no NTFS errors or corruption 
or data loss that we know of.

The VMDK on VMFS5 datastore on iSCSI is also problem-free as far as we 
can tell.

I was wondering if anyone on this list had any ideas or wild speculation 
about why using the VMWare iSCSI initiator and giving the iSCSI target 
to the OS as a raw mapped LUN would cause filesystem corruption, whereas 
the other 2 options are both trouble-free?  Is there some good reason 
why the raw mapped LUN approach is not recommended?  Is it only bad for 
iSCSI or is it also bad for fiber channel etc?

Obviously we won't be doing this again but I wish I had some good 
reasons for why it was so problematic.

Thanks,
Jim

-- 
Jim Ockers, P.E., P.Eng. (ockers at ockers.net)
Contact info: http://www.ockers.net/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://cluedenver.org/pipermail/clue/attachments/20121120/7e65e0ba/attachment.html 


More information about the clue mailing list