Trygve.Com > Diary > JournalWeblogDiaryWhatsis - June, 2004
actor bodybuilder geek weightlifter
World Conquest
June, 2004
Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30
short walk on a long pier

because ... well ... why the hell not ...?

it's a dirty job, but somebody's got to do it.

Sunday, June 27th

17:27PM

To Serve and Protect:

I'm not a great car mechanic. I've changed brake pads, replaced hoses and belts, and done other minor stuff like that, but it's not something I enjoy and, fortunately, it's not a big deal to my ego. In fact, that's extremely fortunate, because, now that I think about it, I'm probably not even a good car mechanic--the only thing I can say in my favor is that what little I have done has worked.

It could be worse. Cars have the great advantage that you can work on them while they're sitting there parked quietly with the engine off. It'd be a lot more difficult if you had to do the repairs while you were driving down the highway. I figure that's one reason why surgeons often get paid more than car mechanics, because they do have to complete their repairs while the engine is still running.

System administration is somewhere in the middle. The fact that you (usually) don't get covered in blood or grease is a big bonus, but a lot of the time you're treating a "patient" that needs to keep running. It's not always possible, and, like automobiles, servers crash sometimes, too.


 one of the drawbacks of running Linux is that it doesn't crash frequently 

It's been a rough month out here in the server department and not all of it has been my fault. A lot of it has been, involving subtle complications of my grand plan to upgrade and replace Nyx a piece at a time. Sometimes it's involuntary, like the recent simultaneous failure of two hard drives in the same RAID array on the main news server. Unfortunately, fault-tolerance and redundancy only gets you so far, but it does get you far enough to save a lot of time and frustration. Usually.

But not always. If it did, I wouldn't have had to recreate this month's blogulation as many times. I might even have written something interesting...instead of this entry that you're reading now.

One of the drawbacks of running Linux is that it doesn't crash frequently. I really should put in some hard drive monitoring software to detect problems with the RAID arrays before they become critical, but I haven't done that yet. So, sometimes I'll have a drive fail and it'll be a while before I notice. This only proved to be a problem twice; once was that recent news machine problem I just mentioned.

The latest series of server adventures started with what might have been a type of denial-of-service attack that I hadn't previously been familiar with. Pulled my hair out for a day, trying to figure out what was going on and at one point I rebooted the webserver to see if that would fix the problem.

It didn't fix the problem, but it did identify one of the drives in its array as having gone bad. It had been running for nearly a year since the last time I'd rebooted it, so it actually could have happened some time ago. If I'd been running Windows IIS, I probably wouldn't ever have to worry about machines going weeks between reboots, let alone a year or more, but I'm not, so I do.

One complication with RAID arrays is that the controllers I have won't allow me to replace a drive with one that's smaller, even if it's only by a cylinder or two. Unfortunately (ii), hard drive manufacturers keep twiddling the sizes of a given model of hard drive, and it often seems like they enjoy making later revisions of a particular model a tiny bit smaller than the earlier ones. Unfortunately (iii), there's no good way I've found in the setup menu when creating the array to select less than the maximum size for the controllers that I'm using, except by the subterfuge that I often do of deliberately mixing up the manufactures and models of drives within a single array so it'll use the size of the smallest drive when allocating space.


 server crashes, car crashes, workstation crashes, hardware woes, and a whole lot of throwing up 


But in my webserver, which had been running for about four years solid, I hadn't done that. All the drives matched...and all the replacement drives of that model which I had on hand were a tiny bit smaller, so it wouldn't let me swap out the failed drive with any of those. On the fourth such attempt, a second drive also failed, which meant that the array was no more. I spent a few more hours trying to get it back to life, even temporarily, but no luck.

And thus began the adventure of rebuilding the webserver. I didn't build a new one from the ground up--which is what I still plan to do--but at least the old one is now running with new drives.

But it was still flaky, so my next adventure was tracking down external sources of flakiness. I don't guarantee that what I was seeing was the result of a denial-of-service attack (this was much more subtle--and much less effective--than the time I'd been mistaken for New York City), but when I'd hardened the network against it, the webserver weirdness cleared up. I'll accept the uncertainty; I'm not going to let down the shields to see if things start messing up again.

Next it's on to build another webserver, but the new machine started giving me occasional random hard drive errors. Things got a lot cleaner after I switched back from the 2.6.7 kernel to 2.4.26, but they still weren't behaving quite right. After changing out a whole lot of different components, I eventually discovered that running the AMI/LSI Logic MegaRAID Elite 1500 in a 3-volt/64-bit PCI slot may work flawlessly when driving arrays mounted in Sun 711 external enclosures, it occasionally locks up under heavy write operations when driving arrays that are operating in LVD mode. The Sun Microsystems 711 enclosure is very cute and compact and I like them a lot, but they don't support LVD operation. Yeah, I've even tried disabling automatic termination and using an external LVD terminator; they still run single-ended.

But, the Enterprise 1500 in the same slot has no such problem with LVD operation. Same PCB, just two more channels. Go figure. In a 5-volt/64-bit PCI slot, they both seem to work. One untested variable is that the 3-volt slots tested were capable of 33/66MHz operation, so it's not absolutely clear whether the relevant problem was the voltage or the speed. (i.e., do the Elite 1500s only support 33MHz operation but don't cause the bus to downshift appropriately?) In any event, it doesn't give you any errors or warnings, stop you from configuring your arrays and formatting them, or anything like that; it just causes the system to lock up mysteriously at some point when it has a large amount of data to write to the array.

But in the process, I pulled parts (including hard drives) that I knew were working from another machine--the predecessor to my current main workstation. No problem, I'd thought; my new machine hadn't had a single problem since I'd set it up three months ago.


 ...foolish enough to do what Windows says... 

Remember how I mentioned up above that I'm not much of a car mechanic and I made the comparison between system administration and medicine? I'd actually started writing this entry some time back; it's just that the combination of crashing servers and the like has gotten in the way of my actually finishing anything. There's some irony in that, back a ways in this timeline, I'd managed to come down with what was certainly the worst case of food poisoning in my life (no surgery involved, however) and then at about this point in the story--while I was getting the bugs out of the new server--I got into a car accident. The only other car accident I've had in my life was when I was a teenager and was stopped at red light. This time was a little more messy; the rain was coming down heavily and I think I must have hit a broad patch of clay that had washed across the road. Until things came to a sudden stop, it was like neither the steering wheel or the brakes had any effect whatsoever. Ooops. Server crashes, car crashes; having had them both in rapid succession, I think I have an easier time coping with the server crashes.

So now I have a squished car. The next thing that happened is that I was checking on which versions I have of some freeware video format conversion tools and I got a little pop-up window that warned that the file I'd selected was corrupt or damaged and suggested "please run chkdsk." I worried a bit about this, but doing a random check of other files and directories on my system, everything else seemed fine. There were no other disk errors in the system event log, before or since, and all other files and directories checked were readable without error, hesitation, or incident.

So I ran chkdsk. Okay, we know at this point that I'm foolish enough to be running Windows on some of my computers and now you know I'm foolish enough to do what Windows says, at least occasionally. Bill Gates' ghost then chose to punish my foolishness by wiping the hard drive. Not completely, but 97% of it. No correlation between file size, creation date, etc., that I could see. In the directories that weren't deleted, it looked like two or three files at random were left and the rest deleted. The only time more files than that were left in a single directory was when it had deleted the directory tree and moved all its files into a directory like \found.000\dir0024.chk.

Ooops, again. I shouldn't have assumed that "please run chkdsk" meant I should run chkdsk. Apparently that's not really what it means at all. Doing a search led me to a Microsoft technical bulletin in which they said that when you see the message, "please run chkdsk," they do not recommend running chkdsk, as data loss may occur.

And so it did. Thanks for the warning.

Fortunately, I just happened to have my old computer which still has reasonably recent copies off all the data that chkdsk wiped out. Or, rather, I did have my old computer which had reasonably recent copies of all the data that chkdsk wiped out...up until I'd pulled the drives and controller out of it to figure out what was wrong with the new webserver. Ooops, again. I guess that should have been "unfortunately" back there at the beginning of this paragraph.

My grand plan was that, since I'd upgraded the house to gigabit, I'd set up a dedicated fileserver downstairs with redundant storage and move all the critical files down there. I just haven't gotten around to doing that yet. Other things have been more pressing in the meantime.

But the truth is that having redundant storage wouldn't have saved me this time anyway, since it wasn't a hardware failure (subsequent testing has not revealed any problems with the drive or controller). I figure that the only thing that would have saved me from this heartache (or "headache"...or whatever body part ache would be most appropriate here) would have been to be running something neither Micro nor Soft for the operating system on the fileserver. Guess that's just what I'll do in the future, then.

After I do some cleanup, organization, and data recovery. Think I'll give the DLT drives a little more exercise, too. It's not like I don't have several hundred DLT tapes on hand. I just hadn't used the drives lately because 1) the new systems I'd set up were fairly minimalist and didn't even have SCSI (this has changed in the last 24 hours) and 2) I'd been trying to cut down the noise levels up here and the DLT drives and enclosures are pretty noisy.

So, in summary, I've had server crashes, car crashes, workstation crashes, hardware woes, and a whole lot of throwing up. On the plus side, most of the administrative weirdness that has resulted from my doing some upgrading and reconfiguring of the Nyx servers seems to have gotten resolved, though we're not entirely sure at this point what finally did fix the problems. I still have some remaining issues to check on even though they aren't causing any problems yet; better to run some tests anyway *before* embarking on the next planned set of changes to the Nyx servers.

But I think I'll run some backups, first.



trygve logo
Trygve.Com
sitemap
what's new
FAQs
diary
images
exercise
singles
humor
recipes
media
weblist
internet
companies
community
video/mp3
comment
contact
Backlogs:
May
April
March
February
January

- 2003 -

December
November
October
September
August
July
June
May
April
March
February
January

- 2002 -

December
November
October
September
August
July
June
May
April
March
February
January

- 2001 -

December
November
October
September
August
July
June
May
April
March
February
January

- 2000 -

December
November
October
September


Looking for somebody else's intimate personal secrets?
journals, burbs, and blogs--oh, my!




Tune in tomorrow for another episode

of


Trygve's Blog
silver kaleidoscope
Trygve's Digital Diary
The base of the tree