![]() |
Hello,
- The Kernel is not panicking. - The machine locks up entirely. At the console too. I will look in to the debug.log idea, though it will grow pretty fast with what we have going through the system. Maintainability might be a concern. The actualy time of the hang is 4:02 based on various logs and activity. Since the whole machine hangs, we feel pretty confident that this is an accurate time. As for the actual second of the hang, we think it might be at 4:02:49. Still checking on that one. :) Although not normally a huge problem, it is looking more like something that tmpwatch is doing from our searching. Thanks again for all the ideas, we will keep everyone posted! Your friendly (and cold) ISP Plattapuss |
Interesting update.
Yesterday we had a similar hang. This time while running slocate in the cron jobs. This cron also works with the tmp directories. We are still investigating. Plattapuss |
Hmmm...
slocate.cron on my box reads:
#!/bin/sh renice +19 -p $$ >/dev/null 2>&1 /usr/bin/updatedb -f "nfs,smbfs,ncpfs,proc,devpts" -e "/tmp,/var/tmp,/usr/tmp,/afs,/net" Which is explicitly ignoring any of the tmp directories and also various net filesystems. What it's not excluding, I notice, is iso9660 and /mnt. Do you possibly have a tripwire database on a CD mounted somewhere? (Yeah, it's a shot in the dark...) |
Okay, so I looked through slocate when I was still asleep :)
Yes it ignores those directories. I will cross-reference the ignored ones with the ones that tmpwatch looks at and see what shows. I have no tripwire db on a cd, nope. Good thought though. Again, thanks for all the suggestions. Plattapuss |
Yup, again...
Last night, same time ... argh. We crashed at 4:06am this time -- see below, as all crons ran this time.
[edit - cron is not causing the problem, at least not obviously] -rob. |
All cron jobs ran succesfully this time. The server went down 4 minutes after the last cron job, and the last cron job was a simple ssript to email me, telling me all the cron jobs ran successfully.
So tonight, we are moving the time of the cron jobs to a decent hour when I and Rob are up so we can monitor the events closely. That hint, thanks to Rob :) Plattapuss |
Update...
So last night, we set cron to run at 10:02pm. And it did. All jobs completed. Then the server crashed.
After the reboot, Plattapuss re-ran each of the 16 crons by hand. All worked fine and the server stayed up. Quite odd. Somehow during all our back and forth discussion, he decided to disable the ensim (the server management software) backup cron job. When we then re-ran the cron task (the automated way), it worked fine. It also worked fine again this morning in a second round of testing. We'll see what happens tonight at 10:00pm, but it looks like this backup cron script was somehow the culprit. Now to figure out why... -rob. |
Was there another event this morning? About 5:04 AM Central time, Feb. 4th? Or do things go offline briefly as part of the normal cron tasks?
|
Yep, it was offline again last night. But this time, we had all sorts of extra monitoring stuff set up, so hopefully we'll get better information on the problem...
-rob. |
As you have probably all noticed, the MacOSXHints sites were down for a few hours tonight. Well, 8 hours to be precise. This downtime was needed to, hopefully once and for all, put an end to our server issues.
After much debugging we think we have found the problem, though we will not be entirely sure for a couple of weeks. I do apologize for such a long down time, and hopefully we won't have to endure any more down time for a long long long time :( Good night Plattapuss |
Considering www.microsoft.de (the german Microsoft site) was down this morning for more than 3 hours you're doing a great job considering that they maybe have 30 times the manpower you have for troublshooting. ;)
|
To followup, our server was swapped out for a complete hardware change last week. Since then we had no crashes...until this morning at about 4:04. Same problem again.
So we can pretty safely rule out a hardware issue at this point. We are now investigating the issues that our version of Redhat might have with the Dual Xeon machines. As always, if anyone has any thoughts... Plattapuss |
Check the RedHat driver for the NIC. MacFixIt had a similar problem and traced it to needing an ethernet driver update from Intel.
|
The problems seem to have returned; cron has crashed the machine each of the last two nights. I talked to the MacFixIt folks; unfortunately, that's not our problem (they were getting log messages about the card prior to failure; our machine just vanishes).
Beginning tonight, we're going to start a form of a "binary tree" analysis by splitting the cron jobs in half. If we can figure out which half crashes the box, we'll split them again, etc. If they don't crash the machine when split the first time, I guess we'll just leave it that way for a while. I know its frustrating for the readers, and very frustrating for Plattapuss and I -- this is now a 100% new machine (motherboard, RAM, hard drives, cards, etc.) and yet the problem persists, so one would now assume it's a software problem... Sorry for the downtime, and as always, troubleshooting suggestions are welcome. -rob. |
I second what he said :)
Plattapuss |
I have had a similar problem with HP hardware, at backup time one night one machine just up and hung. Panic ensued, the machine was power cycled, came up all OK. A few nights later it did it again. It was determined to be the drive firmware. It was updated and all was fine. A few days later another server did the exact same thing (before we could schedule down time and check all our machines of course), same old firmware....
All up three machines 'disappeared' like yours is doing. Firmware updates fixed two of three problem machines (one would not accept upgrade so HW was replaced) and all our other machines were updated before any trouble occurred. Now I know you have replaced disk hardware, but could it be the same firmware on those new drives?? Our problem came out of the blue as well, these machines were running for 6-12-18 months with no problems, one day Ack! and all within two weeks! Hope you find it soon though.... |
Great! I am looking into it as we speak.
Plattapuss |
Something to look for
I did some research yesterday based on a post to another list and discovered that RH 7.3 has a bug in logrotate.
It gets tickled by code like this in /etc/logrotate.d/mailman: Code:
/var/log/mailman/* {I imagine it would also cause you to run out of inodes on /var if this went on long enough. I'd look at all the files in /etc/logrotate.d to see if you've got a rogue wildcard. Breen |
Good thought. I wish that were the solution. But alas, we do not have any wild cards in any of our logrotate.d conf files.
We do have another new machine and harddrive again. This time a completely different brand and size of drive. The thought still being that this is caused by an issue with an incompatability with the Dual Xeon and the drive make and size we were using. So far we are excited over: 12:18pm up 3 days, 18:01 :rolleyes: Plattapuss |
Quote:
Quote:
We'll keep our fingers crossed... Breen |
| All times are GMT -5. The time now is 05:36 PM. |
Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2014, vBulletin Solutions, Inc.
Site design © IDG Consumer & SMB; individuals retain copyright of their postings
but consent to the possible use of their material in other areas of IDG Consumer & SMB.