[CLUE-Tech] RAID 1 on Linux

Nate Duehr nate at natetech.com
Wed Oct 20 03:16:45 MDT 2004


On Oct 19, 2004, at 7:41 PM, Carl Schelin wrote:

> Ok, question on RAIDing an existing linux system.
>
> A little background first. I'm mainly a Solaris guy.
> I've installed RAIDs on Sun boxes many times. The
> procedure is simple. Create the raid using the
> existing slices and add the new slices to the new
> raid. They syncronize and it's working.

Yeah, I was mainly a non-Solaris guy until the last couple of years, 
and I must say, software RAID on Solaris is pretty simple to deal with.

> In a misguided attempt, and because I didn't see the
> RAID option the previous times I've installed Mandrake
> (under the expert menu), I installed Mandrake 9.1 on
> an 80 gig seagate.

Done that.

> After futzing around with mdadm and raidtools, the
> second disk was so fscked up, I had to use dd to fix
> it (I dd'd the good hda over hdc). Finally I was able
> to do a fresh install and found the RAID options under
> the expert menu.

Done that too.  ;-)

> All of the documentation I've seen appear to show that
> the only way to make an existing system in to RAID 1,
> is to back it off, install to RAID and restore the
> data.

Nahh, you can make a RAID1 out of an existing partition.  See below.

> Does anyone have a pointer to a document that debunks
> this? Can I in fact, add a second disk and make the
> system RAID 1 or do I have to back it off and
> reinstall?

I finally figured out most of this from an article in Sysadmin magazine 
about it.  Unfortunately I don't think this particular article is 
available online anywhere.

> Just so you know, I've read the Managing RAID on LINUX
> book (three years out of date), the Software Raid
> HOWTO over at unthought, the Quick Software RAID over
> at linuxhomenetworking, the kernel raid list (just
> poking around in the archives) and even the various
> man pages for mdadm and mkraid.

I'd have to agree here -- I had some questions early-on, and no one 
seems to have been able to find time to update much in the way of docs. 
  It'd be a good project for a Hacking Society meeting if I weren't 
working until 9PM every night on my new schedule.  (And getting me up 
early in the morning to write docs, just isn't ever going to happen.  
Heh.)

> Of course if it's in one of these documents, please
> point me at the right section.
>
> Thanks for any pointers.

Carl, one of the other folks is right in their hunch.  You can create a 
RAID1 with a failed member directly.

So you create a new "RAID'ed" filesystem on the new disk that is 
configured with your original partition on the good disk as a failed 
member, mount the "RAID" (in quotes because it's really only the new 
disk at this point), copy the files over, and then edit fstab to use 
the RAID for that filesystem and either remount or reboot (depending on 
what filesystem you're talking about here), then "repair" the RAID by 
hot-adding the original partition back in.

When you first boot/remount, the system uses the "RAID" and it comes up 
in degraded mode.  You stop and check everything carefully at this 
point and after you're darn sure all your data is there (you of COURSE 
made backups before starting all this right?  GRIN...) you can then 
hot-add the "failed" partition (your original data partition) to the 
RAID1 and it'll synchronize up and be happy.  You then go into the 
raidtab configuration and tell it that disk is no longer a failed 
member also, after you've started the sync.  The key to this is when 
setting up the RAID initially you use the "failed-disk" nomenclature in 
your raidtab instead of the "raid-disk" tag.  "raid-disk" for the new 
drive, "failed-disk" for the old.  Kinda scary the first time you do it 
because you're not sure if it's going to fiddle with that good disk 
you're running from.  Best to practice with an unused but mounted and 
formatted partition with some data in it on the "good" disk first, if 
you have one.

One word of caution here: Make VERY sure your new partitions are ever 
so slightly smaller than the partitions you're starting with -- if the 
two physical disks are not the exact same geometry.  If you attempt to 
hot-add the original partition and it's even a few blocks smaller than 
the "degraded RAID" partition, you'll get a failure message immediately 
that the new partition is too small.

The article in Sysadmin also showed how to layer LVM on top of the 
RAID's -- I didn't really feel the need to go that far, but it was a 
nifty idea, so you could resize everything on the fly, but at the cost 
of huge overhead.

Once your sync is going, you can cat /proc/mdstat to see how it's doing 
and do reboot tests or whatever when it's all done.

This works beautifully for non "/" filesystems.  "/" is a bit harder -- 
you have to reconfigure your bootloader to use the md device, make sure 
your kernel supports it, etc.  And ultimately you're really only 
booting off of one disk so you need to build some options for yourself 
to boot from the other one into your boot menu for times when you have 
a real disk failure.  And if you're using an initrd with your kernel, 
you gotta make sure it's remade too so stuff uses the md device at 
boot.   You also MUST edit the partition table with fdisk or your other 
favorite partition tool and change the partition type to Linux RAID, if 
you want the kernel to use it at boot-time.

Here's the rub though -- software RAID1 on 2.4 kernels from hard 
testing I read on some of the Debian mailing lists from folks like 
Russel Coker who wrote bonnie++, shows that there's NO intelligence 
about read performance in a Linux Kernel software RAID-1.  It *always* 
reads from a single disk, and writes to both.  It gives you 
zero-performance-gain for reads, which a lot of Solaris admins would 
expect to see from their much more mature software RAID software.  You 
just get the redundancy and a speed penalty on writes.  At one point he 
did some really wacky tests like RAID1 across an internal IDE disk and 
an external USB v1 disk -- the kernel would sometimes pick the external 
(slow) disk as the one it was mainly working from (even though it was 
hideously slower than the internal) and would do all reads from the USB 
disk, even though a much faster DMA-enabled disk was sitting there 
doing virtually nothing in the RAID1 array.  That's how I read his test 
data, anyway.

So you make your system slower and gain some data redundancy.  As 
someone put it recently -- anyone who wants to be a Linux Kernel 
superstar and create themselves much fame could fix Linux RAID 1 in the 
kernel, right now.  That's paraphrased from a quote I saw in a magazine 
from one of the kernel developers about RAID-1 support.

Personally I found the performance hit on one of my busier machines not 
to be worth it, and I switched back from software RAID-1 to rsync'ing 
to the second drive periodically and to another machine across the 
network.

My experience with disk failures and Linux software-RAID was not good 
either --  I sat and watched a drive fail in my RAID-1 server one 
night, kernel messages clearly showed it throwing hardware errors, yet 
software RAID never tagged it as "bad" in any way, and during the next 
system reboot (bad idea on my part), software RAID somehow decided the 
disk with the ERRORS was the good disk and started syncing the bad data 
to the good disk.  (Definitely my fault, I forgot to tag the drive bad 
myself.)  Thank goodness for backups.  I'm definitely NOT impressed 
with the Linux kernel RAID-1.

Supposedly, the kernel RAID-5 is much more mature-acting and gets more 
effort from developers, is what I found out when I was researching that 
lovely "let's sync the bad data to the good disk" episode, to see if it 
was common.

I don't keep up on the "latest and greatest" linux kernels, so my 
experience was on a late 2.4 series kernel.  Perhaps someone 
kindhearted has been working on the later 2.6 kernels and the 
performance issues are better.  Best bet would be to do some 
performance tests in your environment with your kernel, if possible.

Hope this helps.

--
Nate Duehr, nate at natetech.com




More information about the clue-tech mailing list