[clue-tech] [Fwd: [clue-admin] Advice admin newb on doing a System backup]

Mon Nov 6 14:00:28 MST 2006

mike havlicek wrote:

> It seems that Sun Microsysystems fixed this problem in
> June 2006. The example that I am eluding to is the bit
> layout of UFS. Speaking of Sun, the bits are "flipped"
> betweem x86 and ultra 64 UFS filesystems.

It isn't just Sun, and not just backups.  Big-endian and little-endian 
problems abound in various ports of Linux on different hardware.

:-)

> It is no problem unless you are doing bare metal
> restore. It is not isolated to Sun. At least Sun
> recognizes the problem and are working on it with ZFS.

Lots of people *recognize* the problem, very few have workable solutions 
that always work and are completely hands-off to deal with it.

:-)

> I agree with one of the earlier posters that "a backup
> is only as good as the restore". I have argued this
> for years. You can backup until the cows come home.
> But if you can't restore ... you are still shit out of
> luck.

Uh-huh.  Getting the bosses to agree to downtime to TEST the recovery 
plan can be harder than actually doing it, too.   Or getting them to buy 
off-line (e.g. NOT MAKING MONEY) hardware to do the tests on.

:-)

Layers 8 and 9 of the OSI model, Religion and Politics!  Those layers 
have screwed up more good technical projects I've worked on over the 
years, than I can count on two hands.

> It might seem reversed but you do have to look at the
> restore plan before you design the backup.

You can look at them as a whole system... no need to do anything in any 
particular order, as long as you look at the system as a whole and test 
it as a whole, so you are confident that at 2AM you can recover from X, 
with X being whatever types of failures you wish to avoid.  Most large 
organizations wimp out on backups and recovery techniques and run so 
much physical disk redundancy with RAID that recovering from a backup is 
a moot point, nowadays.

NAS, RAID, etc... all developed to scratch the itch of "zero-downtime". 
  Then it's all a cost/benefit analysis to determine just how much 
redundancy you can afford.

Of course, having a written plan (even if only for yourself) to make 
sure you cover all the basics of backup and recovery for your 
environment might drive you to design each part in a particular order... 
and NOT doing any particular part (backing up but not ever attempting to 
restore and testing it) leaves you with your butt hanging in the wind, 
later, sometimes more than people realize.

Real-world example:  This weekend was "saved" for one of my customers by 
simple "metadisk" Solaris mirrors -- the box and physical disks were in 
such serious trouble that I was amazed the box was even running, but it was.

The data was moved over the network to a "new" box (it was actually dug 
physically out of storage in a closet I've heard) and then the 
"production" traffic was swung to the "new" machine later when they 
could take the downtime.  In-between they had the "warm fuzzy" that they 
could swing over (in theory) at any time, and a little luck held out and 
the original box stayed up and let them wait a couple of days until they 
could schedule the downtime and do the cut-over.  Good for them!

Would it have been easier/faster with NAS or a SAN or perhaps even 
AUTOMATED?  Yeah.

Was it worth that for the single major failure in about six to seven 
years?  Probably not... but maybe.

That's the type of semi-technical, semi-business-continuity questions 
that technical managers MUST learn to make UP-FRONT.

The most common progression is that company management ignores the issue 
until the first failure, then they add a little redundancy (since it 
wasn't budgeted for that year)... then another more serious failure 
happens and they add a little more at an even greater expense... then 
the next failure is so catastrophic that the backups weren't off-site 
and weren't working and the system is reloaded from scratch and a 
multi-month meeting schedule and project timeline are assigned to a 
project manager, and the cycle continues...

And then there's always "Acts of God"... when the flood waters rise and 
fill your data center, or the hot water heater up three floors in the 
high-rise cascades down through your server room, or the power goes off 
for multiple days... well, there's always the one-offs.  The sysadmin is 
there to "fly the plane" when the hydraulics fail, an engine is on fire, 
and the passengers are screaming...

Today, a good sysadmin with good communication skills and understanding 
of how business and cash flow work, can usually convince management to 
spend enough money on redundancy that the admin can sleep through the 
night for about two years without a failure so bad they have to get up 
and go fix it in the middle of the night.  That seems to be the price 
breakpoint for most big systems.

During those "other" times, you're then on your own with a bottle of 
Jolt or coffee, your brain, a keyboard, and a pile of really messed up 
data on some seriously broken spindles and hopefully a few tapes that 
were shot recently.

:-)

(Replace Jolt/Coffee with a good Single-Malt Scotch if you can get away 
with it and you trust yourself that much!  I don't, but I've met a few 
who do!  Those stories will have to wait for my memoirs... yeah, as if 
anyone would read THOSE!  GRIN...)

> The internal dump and restore work fine for me under
> RedHat linux. I also use the same techniques with
> Solaris 9 & 10. 

dump and restore have their technical warts too, but if you play with 
multiple solutions you learn all of them.

:-)

> Again one does have to be careful with the "bit
> flipping" which is actually an endian problem. 
> 
> I have found that when I directly connect the tape
> drive and dump I can restore bare metal. Both with
> linux and Solaris.

You're ahead of a lot of people on the standard "Unix system backup 
learning curve" then!  (GRIN)

> I have heard nonsense like the existense of a global
> filesystem that works for all unix. I think that is
> bullshit, but if it exists I am very interested.

Ahh, there is always probably something out there that gets "close" but 
it's better to just set up your own environment to be recoverable in a 
time-frame appropriate to your particular business with whatever tools 
you can afford.

If you're doing real-time data processing to handle real-time cash flow 
activities, then the budget can usually be extended to cover lots of 
redundancy.  If you're doing off-line data processing that can handle 
one, two, three, ten, thirty, days of down-time and still catch up with 
no loss of business continuity for your organization, then you won't 
have a budget other than your paycheck to come in at 2AM and fix it.

Follow the money.  :-)  It's all about economics and risk/reward... 
we're all human.

> I haven't gotten answers from SAN manufactures, but I
> think they run into the same problem. Only one
> controlling node defines the filesysystems on shared
> disk space.If your kernel "understands" read and write
> on many filesystems then it will work. 

There are tools to deal with SAN and NAS and the so-called "Universal" 
filesystem problems -- an example would be newer versions of the Veritas 
filesystem, for just one example.

Not cheap, but can be read by big and little endian machines alike 
(because they created their OWN filesystem) and can be "swung" from one 
node to another with relative ease, if attached via NAS, SAN, etc... or 
you can even swing physical cables from a drive array from one box to 
another and just mount up and ride... if the boxes have the right 
hardware interfaces to do such a thing.

Seriously, it's about looking at the overall SYSTEM and determining 
NEEDS first -- THEN you design your backup/recovery plan.  Do you NEED 
to not be paged at 2AM ever?  Would the company lose $400,000 an hour of 
downtime?  How many users (and usually more important... is the CEO in 
that userbase?) would be affected by X failing?

(That CEO comment is sad, but oh-so-true... really.  It's that Layer 8 
and 9 thing again, unless your CEO understands that what they do just 
MIGHT not be as important as a data center outage... most don't.  Their 
desktop e-mail takes priority over systems that make real money for the 
company, sadly enough.  Hell, getting their Palm to sync might... 
depending on their attitude.)

And yes, I've worked for places where the management either simply 
didn't know what they needed or wouldn't say, because it would blow that 
year's budget and make them look bad in the eyes of higher-ups that they 
didn't get their numbers right for that year.

Those types of environments become a lesson in patience and persistence 
if a particular level of redundancy is needed, and resources are 
currently unavailable to make it happen.  You just keep trying, or maybe 
in severe cases, you move on.  I've put up with systems that paged 200 
messages an hour out, and learned to ignore most of the pages, but today 
I have enough experience and time-in, that I could probably avoid 
working on such crappy setups.  Maybe not, you never know in the U.S. 
job market!  (GRIN)

LUCKILY, storage has become one of the most inexpensive things out 
there... raw dumb storage and JBOD's are cheap, and a little brainpower 
and luck, and you can at least back up things that aren't backed up. 
That's generally "step one" in "fixing" an environment if you're not 
even backed up yet.

Tools like Veritas aren't cheap, and rightly so -- someone (a lot of 
someones) somewhere else did all the hard work for you of designing and 
creating and debugging a system that generally works.

But you still ultimately have to be smarter than the computer at the end 
of the day, to end up with anything useful from a computer and a pile of 
backup tools.

(BIG HUGE GRIN...)

Nate