[CLUE-Tech] ext2 file corruption

Jim Ockers ockers at ockers.net
Thu Dec 25 09:02:02 MST 2003


Hi Jed,

> I know this sounds whacko. But I'm 99% certain I'm getting some
> file-corrupting behavior on an ext2 filesystem. Regrettably, I have really
> nothing much to go on here, as the last fsck on the filesystem passed fine
> -- but I'll do another shortly.

It's not the ext2 filesystem.  Tell us about your block device.  If
it's IDE, what is the motherboard chipset?  What kind of IDE cable
are you using?  What kernel version, and does it properly support
your motherboard's chipset?  What kind of hard drive?  How old is
all the hardware?

> I happened upon this by accident, as I was cleaning some things out of my
> home directory. I openned a file in vi to see what it was, and found the
> contents of two files in it. Due to the divergent content, there's just no
> way that I would have appended them, and besides, the older content is
> below the newer. Of course, this now has me concerned.
> 
> I had thought that ext2 was rock solid.

It is.  You can be sure that the filesystem corruption is being caused
by some element of the hardware acting incorrectly.

I have a 3ware Escalade IDE RAID controller that was causing filesystem
corruption on an ext3 filesystem.  I posted to the list about this
problem some time back but got no responses.  I had to turn off the
RAID controller's write cache and since that time I haven't had any
corruption of any sort.

Of course it's much, much slower than it should be, and I'm probably
going to switch to software RAID since the 3ware card isn't fit for
purpose - it's too slow.  (I was using RAID 0, mirroring, on the 3ware.)

I had files that were concatenated, files that had bits flipped or
added, and other weird stuff.  For example I couldn't compile the
linux kernel source when the 3ware card was acting up, because things
would get changed randomly in the files.  One char declaration
in a C source file got changed to #har .  That won't compile.  I did
some ASCII lookups and found that 'c' is 0x63 and '#' is 0x23.  In
binary:

c = 1100011
# = 0100011

As you can see the most significant bit got flipped from a 1 to a 0,
causing the file to be corrupted and the kernel to fail to compile.

This was NOT the filesystem code causing this corruption - it was the
block device or device driver.  Needless to say, fsck returned various
errors of varying severity while the block device was malfunctioning.
You can't trust the output or results of fsck if the underlying 
hardware can't be trusted.

Since I turned off the 3ware 7506 RAID controller's write cache I have
not had a single instance of file corruption, the kernel compiles
normally, etc. etc. etc.  (The 3ware driver is 3w-xxxx.o)

I had to recompile the kernel, of course, to put in the most recent
drivers from 3ware - their tech support refused to help me until I
was running the most recent driver.  Needless to say I couldn't
compile the kernel on the system since the files got more and more
corrupt as time went on.

Hope this helps,
Jim

-- 
Jim Ockers, P.Eng. (ockers at ockers.net)
Contact info: please see http://www.ockers.net/



More information about the clue-tech mailing list