I'm not a great car mechanic. I've changed brake pads, replaced hoses and belts, and done other
minor stuff like that, but it's not something I enjoy and, fortunately, it's not a big deal to my ego.
In fact, that's extremely fortunate, because, now that I think about it, I'm probably not even a
good car mechanic--the only thing I can say in my favor is that what little I have
done has worked.
It could be worse. Cars have the great advantage that you can work on them while
they're sitting there parked quietly with the engine off. It'd be a lot more difficult if you had
to do the repairs while you were driving down the highway. I figure that's one reason why surgeons often get paid
more than car mechanics, because they do have to complete their repairs while the
engine is still running.
System administration is somewhere in the middle. The fact that you (usually) don't get covered
in blood or grease is a big bonus, but a lot of the time you're treating a "patient" that needs
to keep running. It's not always possible, and, like automobiles, servers crash sometimes, too.
one
of the drawbacks of running Linux is that it doesn't crash frequently |
It's been a rough month out here in the server department and not all of it has been my fault. A lot
of it has been, involving subtle complications of my grand plan to upgrade and replace Nyx a piece
at a time. Sometimes it's involuntary, like the recent simultaneous failure of two hard drives in the
same RAID array on the main news server. Unfortunately, fault-tolerance and redundancy only gets
you so far, but it does get you far enough to save a lot of time and frustration. Usually.
But not always. If it did, I wouldn't have had to recreate this month's blogulation as many times.
I might even have written something interesting...instead of this entry that you're reading now.
One
of the drawbacks of running Linux is that it doesn't crash frequently. I really should put in some
hard drive monitoring software to detect problems with the RAID arrays before they become
critical, but I haven't done that yet. So, sometimes I'll have a drive fail and it'll be a while before
I notice. This only proved to be a problem twice; once was that recent news machine problem I
just mentioned.
The latest series of server adventures started with
what might have been a type of denial-of-service attack that I hadn't previously been familiar
with. Pulled my hair out for a day, trying to figure out what was going on and at one point I
rebooted the webserver to see if that would fix the problem.
It didn't fix the problem, but it did identify one of the drives in its array as having gone bad. It
had been running for nearly a year since the last time I'd rebooted it, so it actually could have
happened some time ago. If I'd been running Windows IIS, I probably wouldn't ever have to worry
about machines going weeks between reboots, let alone a year or more, but I'm not, so I do.
One complication with RAID arrays is that the controllers I have won't allow me to replace a drive
with one that's smaller, even if it's only by a cylinder or two. Unfortunately (ii), hard drive manufacturers
keep twiddling the sizes of a given model of hard drive, and it often seems like they enjoy making
later revisions of a particular model a tiny bit smaller than the earlier ones. Unfortunately (iii), there's
no good way I've found in the setup menu when creating the array to select less than the maximum
size for the controllers that I'm using, except by the subterfuge that I often do of deliberately mixing
up the manufactures and models of drives within a single array so it'll use the size of the smallest drive
when allocating space.
server crashes, car crashes, workstation crashes, hardware woes, and a whole lot of
throwing up |
But in my webserver, which had been running for about four years solid, I hadn't done that. All the drives
matched...and all the replacement drives of that model which I had on hand were a tiny bit smaller, so it
wouldn't let me swap out the failed drive with any of those. On the fourth such attempt, a second
drive also failed, which meant that the array was no more. I spent a few more hours trying to get it back
to life, even temporarily, but no luck.
And thus began the adventure of rebuilding the webserver. I didn't build a new one from the ground up--which
is what I still plan to do--but at least the old one is now running with new drives.
But it was still flaky, so my next adventure was tracking down external sources of flakiness. I don't guarantee
that what I was seeing was the result of a denial-of-service attack (this was much more subtle--and much less
effective--than the time I'd been mistaken for New York City), but when I'd hardened the network against it,
the webserver weirdness cleared up. I'll accept the uncertainty; I'm not going to let down the shields to see
if things start messing up again.
Next it's on to build another webserver, but the new machine started giving me occasional random hard drive
errors. Things got a lot cleaner after I switched back from the 2.6.7 kernel to 2.4.26, but they still weren't behaving
quite right.
After changing out a whole lot of different components, I eventually discovered that running the AMI/LSI Logic
MegaRAID Elite 1500 in a 3-volt/64-bit PCI slot may work flawlessly when driving arrays mounted in Sun 711
external enclosures, it occasionally locks up under heavy write operations when driving arrays that are operating in LVD
mode. The Sun Microsystems 711 enclosure is very cute and compact and I like them a lot, but they don't support
LVD operation. Yeah, I've even tried disabling automatic termination and using an external LVD terminator; they
still run single-ended.
But, the Enterprise 1500 in the same slot has no such problem with LVD operation. Same PCB, just
two more channels. Go figure. In a 5-volt/64-bit PCI slot, they both seem to work. One untested variable is that
the 3-volt slots tested were capable of 33/66MHz operation, so it's not absolutely clear whether the relevant
problem was the voltage or the speed. (i.e., do the Elite 1500s only support 33MHz operation but don't cause
the bus to downshift appropriately?) In any event, it doesn't give you any errors or warnings, stop you from
configuring your arrays and formatting them, or anything like that; it just causes the system to lock up
mysteriously at some point when it has a large amount of data to write to the array.
But in the process, I pulled parts (including hard drives) that I knew were working from another
machine--the predecessor to my current main workstation. No problem, I'd thought; my new machine hadn't
had a single problem since I'd set it up three months ago.
...foolish enough to do what Windows says... |
Remember how I mentioned up above that I'm not much of a car mechanic and I made the comparison
between system administration and medicine? I'd actually started writing this entry some time back; it's just that
the combination of crashing servers and the like has gotten in the way of my actually finishing anything. There's
some irony in that, back a ways in this timeline, I'd managed to come down with what was certainly the worst
case of food poisoning in my life (no surgery involved, however) and then at about this point in the story--while I was getting
the bugs out of the new server--I got into a car accident. The only other car accident I've had in my life was when
I was a teenager and was stopped at red light. This time was a little more messy; the rain was coming down heavily
and I think I must have hit a broad patch of clay that had washed across the road. Until things came to a sudden
stop, it was like neither the steering wheel or the brakes had any effect whatsoever. Ooops. Server crashes, car
crashes; having had them both in rapid succession, I think I have an easier time coping with the server crashes.
So now I have a squished car. The next thing that happened is that I was checking on which versions I have of some
freeware video format conversion tools and I got a little pop-up window that warned that the file I'd selected was
corrupt or damaged and suggested "please run chkdsk." I worried a bit about this, but doing a random check of
other files and directories on my system, everything else seemed fine. There were no other disk errors in the system
event log, before or since, and all other files and directories checked were readable without error, hesitation, or incident.
So I ran chkdsk. Okay, we know at this point that I'm foolish enough to be running Windows on some of my computers
and now you know I'm foolish enough to do what Windows says, at least occasionally. Bill Gates' ghost
then chose to punish my foolishness by wiping the hard drive. Not completely, but 97% of it. No
correlation between file size, creation date, etc., that I could see. In the directories that weren't deleted, it looked
like two or three files at random were left and the rest deleted. The only time more files than that were left in a
single directory was when it had deleted the directory tree and moved all its files into a directory like
\found.000\dir0024.chk.
Ooops, again. I shouldn't have assumed that "please run chkdsk" meant I should run chkdsk. Apparently that's not
really what it means at all. Doing a search led me to a Microsoft technical bulletin in which they said that when you see
the message, "please run chkdsk," they do not recommend running chkdsk, as data loss may occur.
And so it did. Thanks for the warning.
Fortunately, I just happened to have my old computer which still has reasonably recent copies off all the data that
chkdsk wiped out. Or, rather, I did have my old computer which had reasonably recent copies of
all the data that chkdsk wiped out...up until I'd pulled the drives and controller out of it to figure out what was
wrong with the new webserver. Ooops, again. I guess that should have been "unfortunately" back there at the
beginning of this paragraph.
My grand plan was that, since I'd upgraded the house to gigabit, I'd set up a dedicated fileserver downstairs
with redundant storage and move all the critical files down there. I just haven't gotten around to doing that
yet. Other things have been more pressing in the meantime.
But the truth is that having redundant storage wouldn't have saved me this time anyway, since it wasn't a hardware
failure (subsequent testing has not revealed any problems with the drive or controller). I figure that the only thing
that would have saved me from this heartache (or "headache"...or whatever body part ache would be most appropriate here) would
have been to be running something neither Micro nor Soft for the operating system on the fileserver. Guess that's
just what I'll do in the future, then.
After I do some cleanup, organization, and data recovery. Think I'll give the DLT drives a little more exercise, too.
It's not like I don't have several hundred DLT tapes on hand. I just hadn't used the drives lately because 1) the new
systems I'd set up were fairly minimalist and didn't even have SCSI (this has changed in the last 24 hours) and 2) I'd
been trying to cut down the noise levels up here and the DLT drives and enclosures are pretty noisy.
So, in summary, I've had server crashes, car crashes, workstation crashes, hardware woes, and a whole lot of
throwing up. On the plus side, most of the administrative weirdness that has resulted from my doing some
upgrading and reconfiguring of the Nyx servers seems to have gotten resolved, though we're not entirely sure
at this point what finally did fix the problems. I still have some remaining issues to check on even
though they aren't causing any problems yet; better to run some tests anyway *before* embarking on the next
planned set of changes to the Nyx servers.
But I think I'll run some backups, first.