[clue-tech] Bad IDE hardware + compactflash, what to do?

Jim Ockers ockers at ockers.net
Thu Dec 15 15:31:01 MST 2005


Hi Cluebies,

I'm not sure what to do about this problem.  I'm hoping someone
on the list has some ideas about how to fix it.  This e-mail is 
long but thorough.

We have a Via motherboard with an IDE interface and a compactflash
socket that makes CF cards act like IDE disk drives.  (I.e., /dev/hdd)
The IDE controllers are on IRQ14 (primary) and IRQ15 (secondary).

A definition. "Interrupt latency" is the length of time between
when a device (hardware) raises its IRQ line and requests interrupt
service from the CPU, and when the device driver's interrupt service
routine lowers the IRQ line after servicing the interrupt.  Normal
interrupt latency is 20us or so I'm told.

We have observed that IRQ15 interrupt line is high for periods up
to 80ms.  (note ms not us, so this is Really Bad.)  During this time
the serial UART is unable to get its interrupt serviced (and FIFO
emptied) so it throws away data and logs buffer overrun errors.
When the pppd on layer 2 gets the frames they are corrupted and
the FCS doesn't match, so that frame is discarded.  The PPP serial
link runs at 921600 bps, so there are quite a few bits that get
thrown away even in a few ms.

If we enable DMA transfers on the compactflash IDE device, then
there are no IDE IRQs, and the serial data transfer occurs normally
with no discarded frames.  (We verified this with a compactflash
to IDE adapter, and connecting it to the primary IDE interface so
the CF was /dev/hda.)  Normally there's nothing on the IDE bus
except for the CF at /dev/hdd (secondary slave).  We can't change
this setting even in the BIOS, it's hardwired that way on the
board.

Unfortunately our motherboard has the DMA pin disconnected on the
compactflash socket, so DMA is not an option.  The boards are 
expensive and we have 2,000 of them.  The board has Via 82c596b 
IDE chipset.

 VP_IDE: VIA vt82c596b (rev 23) IDE UDMA66 controller on pci00:07.1
    ide0: BM-DMA at 0xfc00-0xfc07, BIOS settings: hda:pio, hdb:pio
    ide1: BM-DMA at 0xfc08-0xfc0f, BIOS settings: hdc:pio, hdd:DMA

So, what can I do to make it so the IDE hardware or device driver
interrupt handler (ISR == interrupt service routine) is more
cooperative with the rest of the system?  Here are some ideas
we've had:

1. hdparm -u1 /dev/hdd ( hdparm -u: get/set unmaskirq flag (0/1) )
Unfortunately this has minimal effect.  In fact there is almost no 
visible effect, even though it's supposed to fix this exact problem.
>From the hdparm man page:

       -u     Get/set interrupt-unmask flag for the drive.  A setting  of  1  permits  the
              driver  to  unmask  other  interrupts during processing of a disk interrupt,
              which greatly improves Linux's responsiveness and  eliminates  "serial  port
              overrun"  errors.  Use this feature with caution: some drive/controller com-
              binations do not tolerate the increased I/O  latencies  possible  when  this
              feature is enabled, resulting in massive filesystem corruption.  In particu-
              lar, CMD-640B and RZ1000 (E)IDE interfaces can be unreliable (due to a hard-
              ware  flaw)  when  this  option  is  used  with kernel versions earlier than
              2.0.13.  Disabling the IDE prefetch feature of these interfaces  (usually  a
              BIOS/CMOS  setting) provides a safe fix for the problem for use with earlier
              kernels.

We observed the IDE and serial interrupt latencies with an 
oscilloscope while triggering on IDE interrupts longer than 50us.
There was very little difference with -u0 or -u1.  By the way we are 
using kernel 2.4.22.

2. Retrofit thousands of embedded devices with an IDE compactflash
adapter that enables DMA.  WAY too expensive, and we don't have
them all here anyway.  It would take years, or hundreds of thousands
of dollars, to touch them all.  Software upgrades are much easier
since we have network access to them via satellite links.

3. Hack the kernel's IDE device driver.  This seems like it would
be a bit of work.  If the hdparm -u1 doesn't have any effect maybe
this is a hardware bug not a software bug.  Anyone know how to do
this or what we could look for?  Any easy tweaks we could make?

4. Look at realtime linux (rtlinux) or different IRQ scheduler
algorithms.  I don't know anything about this, does anyone on the
list?  Is it likely we could work around this problem with a real-
time kernel?

Thanks,
Jim

-- 
Jim Ockers, P.Eng. (ockers at ockers.net)
Contact info: please see http://www.ockers.net/
_______________________________________________
CLUE-tech mailing list
CLUE-tech at cluedenver.org
http://cluedenver.org/mailman/listinfo/clue-tech



More information about the clue-tech mailing list