|
|
#1 | |||||||||||||||||||
|
MVP
Join Date: Dec 2001
Location: Portland, OR
Posts: 1,472
|
Help debug a macosxhints.com crash...
In case you didn't notice (I certainly did!), macosxhints.com and the forum site were offline for a few hours this morning. My ISP has managed to deduce the following by looking at the log files:
We figured this out as it was exactly 10 days ago when we had our last outage, so we're very suspicious of this script. The bit that cleans up the tmp directory looks like this: Code:
% less tmpwatch
/usr/sbin/tmpwatch 240 /tmp
/usr/sbin/tmpwatch 720 /var/tmp
for d in /var/{cache/man,catman}/{cat?,X11R6/cat?,local/cat?}; do
if [ -d "$d" ]; then
/usr/sbin/tmpwatch -f 720 $d
fi
done
thanks; -rob. |
|||||||||||||||||||
|
|
|
|
|
#2 |
|
Prospect
Join Date: Jan 2002
Location: San Diego, CA
Posts: 3
|
I'd be interested in seeing the indications in the logfiles that led your ISP to suspect tmpwatch. Are there specs on your server posted somewhere? Hardware? OS?
|
|
|
|
|
|
#3 |
|
Site Admin
Join Date: Jan 2002
Location: Montreal
Posts: 32,473
|
monitor your tmp files
I don't have much to offer here but I would suggest that you might start monitoring your tmp files (those created by Geeklog) yourself to see if there are any which stick around - i.e. are not so temporary. You could do this with a script that gets called periodically via cron and does a find for files older than a certain amount, emailing or logging the results. If there are any files which stick around, you can then try to figure out why.
|
|
|
|
|
|
#4 |
|
Moderator
Join Date: Jan 2002
Location: Singapore
Posts: 4,237
|
What about using xtail in a script to monitor the tmp files?
|
|
|
|
|
|
#5 |
|
MVP
Join Date: Dec 2001
Location: Portland, OR
Posts: 1,472
|
Regarding how we zeroed in on the tmp files and the cleanup script ... we had this crash one previous time since the site move, and it was almost exactly 10 days ago ... within the realm of rounding, the cron job last night ran just after the 10 day mark was reached.
That's what pointed us in that direction ... and I'll certainly keep an eye on the directory in the future to see what (if anything) is accumulating there. Maybe we could modify the script to speicifically ignore select filenames if we can identify them... -rob. |
|
|
|
|
|
#6 |
|
Prospect
Join Date: Jan 2002
Location: San Diego, CA
Posts: 3
|
From what I've been able to gather, looks like you're running a Red Hat box. Assuming you've run up2date and everything is current? I run several RH boxes myself with this same script in place and significant activity in the tmp dirs (PHP sessions). Have yet to see it cause a problem, let alone a crash. Not sure what distro you have installed, but ths may be of interest:
https://rhn.redhat.com/errata/RHBA-2001-104.html |
|
|
|
|
|
#7 |
|
Major Leaguer
Join Date: Jan 2003
Location: Bay Area
Posts: 327
|
troubleshooting
Have you tried running that script manually? At least you'll be able to watch what happens...
|
|
|
|
|
|
#8 |
|
MVP
Join Date: Dec 2001
Location: Portland, OR
Posts: 1,472
|
Hmm, not sure how long I'd be able to watch it as it seems to shut down Apache at some point!
At the moment, the /tmp directory is completely empty; I think I'll add a script that sends a directory list to me a couple times a day just so I can see what's in it at various times. As far as Redhat goes, I believe we're up to date, but I'll certainly double-check tonight! Thanks for the ideas thus far... -rob. |
|
|
|
|
|
#9 |
|
Prospect
Join Date: Jan 2002
Posts: 4
|
Take a look at the apache error log around the time you think the server went down.
It'd be useful to know if the server died or was shutdown. |
|
|
|
|
|
#10 |
|
MVP
Join Date: Dec 2001
Location: Portland, OR
Posts: 1,472
|
The last entry before five hours of downtime is this:
Code:
[Wed Jan 22 04:02:18 2003] [error] [client xxx.xx.xx.xx]
request failed: error reading the headers
-rob. |
|
|
|
|
|
#11 |
|
Prospect
Join Date: Jan 2002
Location: San Diego, CA
Posts: 3
|
The time is definitely incriminating. 4:02AM is when Red Hat's daily cron jobs run. This could mean any script residing in the /etc/cron.daily directory though... not just tmpwatch.
Are we talking just an Apache crash? Or the whole machine? If it's just Apache, I've seen something like this when the logs get rotated. The logrotate script sends a 'kill -HUP' to apache which should cause it to reread it's config and create new logfiles (since the old ones have been rolled). This process doesn't always work though. What I've done on systems exhibiting this problem is to change a line in the file /etc/logrotate.d/apache: You can see the line in there that sends the 'kill -HUP to apache... change the line to: /etc/init.d/httpd restart 2> /dev/null || true This will make apache do a full restart which might result in a few seconds of 'downtime', but is probably safer anyway. |
|
|
|
|
|
#12 |
|
MVP
Join Date: Dec 2001
Location: Portland, OR
Posts: 1,472
|
It's not just Apache, as the server basically vanishes from existence -- ftp, ssh, etc. all fail after the crash.
-rob. |
|
|
|
|
|
#13 |
|
Major Leaguer
Join Date: Aug 2002
Location: Montreal
Posts: 373
|
You might want to try this:::
# begin # find all files not accessed within the last 7 days in /var/adm, /var/tmp and /tmp find /var/adm -type f -atime +7 -print > /var/tmp/deadfiles-varadm & find /var/tmp -type f -atime +7 -print > /var/tmp/deadfiles-vartmp & find /tmp -type f -atime +7 -print > /var/tmp/deadfiles-tmp & # remove all files not accessed within the last 7 days in /var/adm, /var/tmp, and /tmp rm `cat /var/tmp/deadfiles-varadm` rm `cat /var/tmp/deadfiles-vartmp` rm `cat /var/tmp/deadfiles-tmp` # clear "deadfiles-*" repositories rm -r /var/tmp/* # delete core files # the following for Solaris only # delete core files find . -name core -exec rm {} \; # delete crash dump files rm -r /var/crash/<system>/* #end <system> is the hostname of the box that made the crash dump files... Also, instead of -atime +n (finds files that haven't been accessed within +n days), you could use -mtime +n (finds files that haven't been modified within +n days); this would leave any files that had been accessed within the last n days, and may give you a clue as to which file or files are the culprit to your crashes. If that doesn't work then you'll have to check your processes to see if any of them lock or use files in either /var/tmp or /tmp.... a very slim possibility Last edited by Glanz; 01-22-2003 at 02:49 PM. |
|
|
|
|
|
#14 | |||||||||||||||||||
|
Prospect
Join Date: Jan 2002
Posts: 4
|
Ok, now we're getting somewhere. Might be a hardware issue. Have you tried running the scripts that run manually and seeing what happens? I've seen crashes as you're describing on high disk I/O and memory usage. |
|||||||||||||||||||
|
|
|
|
|
#15 |
|
MVP
Join Date: Dec 2001
Location: Portland, OR
Posts: 1,472
|
The scripts run every night (I believe), and they seem to succeed every night except the 10th in a row ... so is it possible that one script would trigger a hardware glitch when the others do not?
Perhaps we'll do some experimenting this weekend... -rob. |
|
|
|
|
|
#16 | |||||||||||||||||||
|
Prospect
Join Date: Jan 2002
Posts: 4
|
Correct, something is throwing it over the edge. Since the current theory is that something happens every 10 days, it's assummed that it's tmpwatch because it's the only script that has something with that time limit. One thing to try is change the 10 days to a few hours while you're testing so that it actually runs. You may not actually see anythign go wrong until the conditions are just right (i.e. lots of file in /tmp/). I'm assuming each time the server crashed the ISP rebooted the machine. Was there anything on the console (if there was a monitor attached). Also it'd be helpful to have the full specs of the server. |
|||||||||||||||||||
|
|
|
|
|
#17 | ||||||||||||||||||||||||||||||||||||||
|
Major Leaguer
Join Date: Jan 2003
Location: Bay Area
Posts: 327
|
Right. At 4:02 as an earlier poster pointed out.
Hmm. Let me guess -- /etc/cron.daily/tmpwatch checks files under /tmp and /var. I bet that one or both of those is running on its own partition. That could be the source of a hw glitch. You might want to fsck those partitions. |
||||||||||||||||||||||||||||||||||||||
|
|
|
|
|
#18 | |||||||||||||||||||
|
MVP
Join Date: Jan 2002
Location: Brisbane, Australia
Posts: 1,108
|
Re: Help debug a macosxhints.com crash...
If we look at the script above, it appears that all directories above would be searched every single night, so hardware 'read' problems should be out. It still could be hardware as there would be much more trashing when removing a bunch of files. tmpwatch says it only removes regular files and empty directories. So, it should leave pipes and special files alone. It could remove an empty directory that something assumes is going to be there and that could cause application crashes when it is not (but an OS crash, hmm, I wouldn't think so). What I would do would be to run another tmpwatch script an hour before this one runs with options: Code:
--test Doesn't remove files, but goes through the motions of removing them. This implies -v. -v Print a verbose display. Two levels of verboseness are available -- use this option twice to get the most verbose output. Actually you should be able to do this a day after the machine came back up and use 24 for hours as the offending file, if there is one, should already be there. An fsck wouldn't be a bad idea, but I would think one would be being run after every crash anyway.... Good luck Edit: Moderator added line breaks to "code" tags to narrow display width...
__________________
Douglas G. Stetner UNIX Live Free Or Die Last edited by stetner; 01-22-2003 at 07:39 PM. |
|||||||||||||||||||
|
|
|
|
|
#19 |
|
Site Admin
Join Date: Jan 2002
Location: Montreal, Quebec
Posts: 34
|
Thanks for all the ideas on the server. Here is some more info so far:
- first and foremost, the system is up-to-date as of today. - the logs show each cron job as it runs, tmpwatch is the last one to run before the system hangs. - fsck is run on reboot as it does on any reboot from a crash on Redhat I am putting in a small cron job called 'zz' to simply echo a word. This will let me see if that cron job gets run at all or if the system goes down on tmpwatch. I will also include the hint from 'stetner' to run tmpwatch in test mode before the real tmpwatch cron job, to get a file listing. I will have it dump to an incremental file so 'if' the machine goes down again, I will have some more information. We also ran a memory check, it all looks good. To help us all out I have implemented a monitoring system to notify both myself and Rob if the server goes down again. So if I miss the notice, Rob can drag me out of bed in our lovely -25C (-35C with windchill) weather at 4:02 in the morning. The machine these sites are on is a Dual 2.0GHz Xeon with 1GB of ECC DDR and two 120GB IDE drives (yes I have heard the arguments about IDE versus SCSI )Thanks again for the help Your friendly ISP guy Plattapuss |
|
|
|
|
|
#20 | |||||||||||||||||||
|
Prospect
Join Date: Jan 2003
Posts: 4
|
I'd like to be clear about what is happening. I've read, crash, hang, etc. Some questions: - Is the machine kernel panicing? If so, there should be an indication on the console. - If the machine is not panicing, what is happening? Is it just dissappearing from the network (that is, not even pingable)? Or does it become unresponsive on the console as well? - I'm not sure what your syslog.conf looks like, but you might consider adding an entry to catch everything: *.* /var/log/debug.log then touch /var/log/debug.log and HUP syslog. Depending upon how fast this grows, you might want to rotate it via logrotate. You might also want to use the syslog mark facility to pin down the time the system is crashing/hanging, though it looks like you get enough HTTP traffic that the apache logs are sufficient to indicate the time of the crash/hang. Of course, if the machine is just dissappearing from the network but not actually hanging, then the HTTP logs will obviously stop when there is no more network, even though the server may still be up. Good luck, j. |
|||||||||||||||||||
|
|
|
![]() |
|
|