[clue-tech] Bad IDE hardware + compactflash, what to do?

Jim Ockers ockers at ockers.net
Sat Dec 17 11:52:21 MST 2005


Hi Dave,

Thanks for the thoughtful response.

> Sorry that I don't have any specific answers for your problem.  But...
> 
> Jim Ockers wrote:
> [...]
> > Unfortunately our motherboard has the DMA pin disconnected on the
> > compactflash socket, so DMA is not an option.  The boards are 
> > expensive and we have 2,000 of them.  The board has Via 82c596b 
> > IDE chipset.
> 
> What changed?  Surely you didn't deploy 2,000 boards that don't work in 
> your application.  Why is this a problem now and not when you were 
> prototyping?

Nobody noticed that this was a problem.  That said, it's not a super
serious problem, but when we tried to add some new functionality we
noticed that the firmware upgrade download was taking much longer
than it should.  That's when we noticed the correlation between lost
data on the serial port while trying to write the data to disk.

Our customers time costs thousands of dollars an hour so it's bad
form for us to cause them to wait longer than necessary due to a
defect in our system, when a software upgrade/fix is necessary.

> > 3. Hack the kernel's IDE device driver.  This seems like it would
> > be a bit of work.  If the hdparm -u1 doesn't have any effect maybe
> > this is a hardware bug not a software bug.  Anyone know how to do
> > this or what we could look for?  Any easy tweaks we could make?
> > 
> > 4. Look at realtime linux (rtlinux) or different IRQ scheduler
> > algorithms.  I don't know anything about this, does anyone on the
> > list?  Is it likely we could work around this problem with a real-
> > time kernel?
> 
> Can you determine why the interrupt takes so long to reset?  My 
> understanding is that when an interrupt happens an ISR gets run.  As 
> soon as it's "safe" the interrupt is reset (don't know whether unmasking 
> can happen before it's "safe").  So it seems like there's a bit of code 
> that takes too long to get done.  If slow CF means it isn't safe to 
> reset for 80ms then you seem to be out of luck.

How would we determine this?  We can see the lengthy interrupts
using the oscilloscope but I'm not sure how to tell which part of
the code is taking too long to return.  I suspect it's something
in the hardware.

Not all IDE interrupts are 80ms, just some of them.  Whenever a
long IDE interrupt happens we lose data on the serial port.

> Perhaps this depends on the size of the data transfer and splitting it 
> up somehow would help?

Our workaround is to download the huge file to a ramdisk then copy
it to the flash after it's downloaded.  Obviously the ramdisk does
not generate IDE interrupts, so we never lose packets during the
download.

However as a result of these findings, some of the senior hardware 
people have said some Very Nasty Things about the Linux kernel and 
I'd like to prove them wrong - but I'm not sure how. 

Things like how large parts of the kernel and OS were written by
teenagers living in their parents basement.  I'm sure Andre Hedrick
would take exception to this characterization.  :)

> Maybe the board vendor or the kernel maintainers could help you identify 
> a fix, if you can figure out the cause of the latency.

Can you suggest what we could do to figure out the cause?

Thanks,
Jim

-- 
Jim Ockers, P.Eng. (ockers at ockers.net)
Contact info: please see http://www.ockers.net/
_______________________________________________
CLUE-tech mailing list
CLUE-tech at cluedenver.org
http://cluedenver.org/mailman/listinfo/clue-tech



More information about the clue-tech mailing list