[clue-tech] [Fwd: [clue-admin] Advice admin newb on doing a System
backup]
Nate Duehr
nate at natetech.com
Mon Nov 6 14:00:28 MST 2006
mike havlicek wrote:
> It seems that Sun Microsysystems fixed this problem in
> June 2006. The example that I am eluding to is the bit
> layout of UFS. Speaking of Sun, the bits are "flipped"
> betweem x86 and ultra 64 UFS filesystems.
It isn't just Sun, and not just backups. Big-endian and little-endian
problems abound in various ports of Linux on different hardware.
:-)
> It is no problem unless you are doing bare metal
> restore. It is not isolated to Sun. At least Sun
> recognizes the problem and are working on it with ZFS.
Lots of people *recognize* the problem, very few have workable solutions
that always work and are completely hands-off to deal with it.
:-)
> I agree with one of the earlier posters that "a backup
> is only as good as the restore". I have argued this
> for years. You can backup until the cows come home.
> But if you can't restore ... you are still shit out of
> luck.
Uh-huh. Getting the bosses to agree to downtime to TEST the recovery
plan can be harder than actually doing it, too. Or getting them to buy
off-line (e.g. NOT MAKING MONEY) hardware to do the tests on.
:-)
Layers 8 and 9 of the OSI model, Religion and Politics! Those layers
have screwed up more good technical projects I've worked on over the
years, than I can count on two hands.
> It might seem reversed but you do have to look at the
> restore plan before you design the backup.
You can look at them as a whole system... no need to do anything in any
particular order, as long as you look at the system as a whole and test
it as a whole, so you are confident that at 2AM you can recover from X,
with X being whatever types of failures you wish to avoid. Most large
organizations wimp out on backups and recovery techniques and run so
much physical disk redundancy with RAID that recovering from a backup is
a moot point, nowadays.
NAS, RAID, etc... all developed to scratch the itch of "zero-downtime".
Then it's all a cost/benefit analysis to determine just how much
redundancy you can afford.
Of course, having a written plan (even if only for yourself) to make
sure you cover all the basics of backup and recovery for your
environment might drive you to design each part in a particular order...
and NOT doing any particular part (backing up but not ever attempting to
restore and testing it) leaves you with your butt hanging in the wind,
later, sometimes more than people realize.
Real-world example: This weekend was "saved" for one of my customers by
simple "metadisk" Solaris mirrors -- the box and physical disks were in
such serious trouble that I was amazed the box was even running, but it was.
The data was moved over the network to a "new" box (it was actually dug
physically out of storage in a closet I've heard) and then the
"production" traffic was swung to the "new" machine later when they
could take the downtime. In-between they had the "warm fuzzy" that they
could swing over (in theory) at any time, and a little luck held out and
the original box stayed up and let them wait a couple of days until they
could schedule the downtime and do the cut-over. Good for them!
Would it have been easier/faster with NAS or a SAN or perhaps even
AUTOMATED? Yeah.
Was it worth that for the single major failure in about six to seven
years? Probably not... but maybe.
That's the type of semi-technical, semi-business-continuity questions
that technical managers MUST learn to make UP-FRONT.
The most common progression is that company management ignores the issue
until the first failure, then they add a little redundancy (since it
wasn't budgeted for that year)... then another more serious failure
happens and they add a little more at an even greater expense... then
the next failure is so catastrophic that the backups weren't off-site
and weren't working and the system is reloaded from scratch and a
multi-month meeting schedule and project timeline are assigned to a
project manager, and the cycle continues...
And then there's always "Acts of God"... when the flood waters rise and
fill your data center, or the hot water heater up three floors in the
high-rise cascades down through your server room, or the power goes off
for multiple days... well, there's always the one-offs. The sysadmin is
there to "fly the plane" when the hydraulics fail, an engine is on fire,
and the passengers are screaming...
Today, a good sysadmin with good communication skills and understanding
of how business and cash flow work, can usually convince management to
spend enough money on redundancy that the admin can sleep through the
night for about two years without a failure so bad they have to get up
and go fix it in the middle of the night. That seems to be the price
breakpoint for most big systems.
During those "other" times, you're then on your own with a bottle of
Jolt or coffee, your brain, a keyboard, and a pile of really messed up
data on some seriously broken spindles and hopefully a few tapes that
were shot recently.
:-)
(Replace Jolt/Coffee with a good Single-Malt Scotch if you can get away
with it and you trust yourself that much! I don't, but I've met a few
who do! Those stories will have to wait for my memoirs... yeah, as if
anyone would read THOSE! GRIN...)
> The internal dump and restore work fine for me under
> RedHat linux. I also use the same techniques with
> Solaris 9 & 10.
dump and restore have their technical warts too, but if you play with
multiple solutions you learn all of them.
:-)
> Again one does have to be careful with the "bit
> flipping" which is actually an endian problem.
>
> I have found that when I directly connect the tape
> drive and dump I can restore bare metal. Both with
> linux and Solaris.
You're ahead of a lot of people on the standard "Unix system backup
learning curve" then! (GRIN)
> I have heard nonsense like the existense of a global
> filesystem that works for all unix. I think that is
> bullshit, but if it exists I am very interested.
Ahh, there is always probably something out there that gets "close" but
it's better to just set up your own environment to be recoverable in a
time-frame appropriate to your particular business with whatever tools
you can afford.
If you're doing real-time data processing to handle real-time cash flow
activities, then the budget can usually be extended to cover lots of
redundancy. If you're doing off-line data processing that can handle
one, two, three, ten, thirty, days of down-time and still catch up with
no loss of business continuity for your organization, then you won't
have a budget other than your paycheck to come in at 2AM and fix it.
Follow the money. :-) It's all about economics and risk/reward...
we're all human.
> I haven't gotten answers from SAN manufactures, but I
> think they run into the same problem. Only one
> controlling node defines the filesysystems on shared
> disk space.If your kernel "understands" read and write
> on many filesystems then it will work.
There are tools to deal with SAN and NAS and the so-called "Universal"
filesystem problems -- an example would be newer versions of the Veritas
filesystem, for just one example.
Not cheap, but can be read by big and little endian machines alike
(because they created their OWN filesystem) and can be "swung" from one
node to another with relative ease, if attached via NAS, SAN, etc... or
you can even swing physical cables from a drive array from one box to
another and just mount up and ride... if the boxes have the right
hardware interfaces to do such a thing.
Seriously, it's about looking at the overall SYSTEM and determining
NEEDS first -- THEN you design your backup/recovery plan. Do you NEED
to not be paged at 2AM ever? Would the company lose $400,000 an hour of
downtime? How many users (and usually more important... is the CEO in
that userbase?) would be affected by X failing?
(That CEO comment is sad, but oh-so-true... really. It's that Layer 8
and 9 thing again, unless your CEO understands that what they do just
MIGHT not be as important as a data center outage... most don't. Their
desktop e-mail takes priority over systems that make real money for the
company, sadly enough. Hell, getting their Palm to sync might...
depending on their attitude.)
And yes, I've worked for places where the management either simply
didn't know what they needed or wouldn't say, because it would blow that
year's budget and make them look bad in the eyes of higher-ups that they
didn't get their numbers right for that year.
Those types of environments become a lesson in patience and persistence
if a particular level of redundancy is needed, and resources are
currently unavailable to make it happen. You just keep trying, or maybe
in severe cases, you move on. I've put up with systems that paged 200
messages an hour out, and learned to ignore most of the pages, but today
I have enough experience and time-in, that I could probably avoid
working on such crappy setups. Maybe not, you never know in the U.S.
job market! (GRIN)
LUCKILY, storage has become one of the most inexpensive things out
there... raw dumb storage and JBOD's are cheap, and a little brainpower
and luck, and you can at least back up things that aren't backed up.
That's generally "step one" in "fixing" an environment if you're not
even backed up yet.
Tools like Veritas aren't cheap, and rightly so -- someone (a lot of
someones) somewhere else did all the hard work for you of designing and
creating and debugging a system that generally works.
But you still ultimately have to be smarter than the computer at the end
of the day, to end up with anything useful from a computer and a pile of
backup tools.
(BIG HUGE GRIN...)
Nate
More information about the clue-tech
mailing list