Go Back   The macosxhints Forums > OS X Help Requests > UNIX - General



Reply
 
Thread Tools Rate Thread Display Modes
Old 01-22-2003, 10:11 AM   #1
griffman
MVP
 
Join Date: Dec 2001
Location: Portland, OR
Posts: 1,472
Help debug a macosxhints.com crash...

In case you didn't notice (I certainly did!), macosxhints.com and the forum site were offline for a few hours this morning. My ISP has managed to deduce the following by looking at the log files:
Quote:
The cron job that seems to end the server's life is tmpwatch, which cleans up files that are 240 and 720 hours old in various directories. So my logic is if it cleans up files in /tmp that are 240 hours old, then it would coincide with the last crash.

We figured this out as it was exactly 10 days ago when we had our last outage, so we're very suspicious of this script. The bit that cleans up the tmp directory looks like this:
Code:
% less tmpwatch
/usr/sbin/tmpwatch 240 /tmp
/usr/sbin/tmpwatch 720 /var/tmp
for d in /var/{cache/man,catman}/{cat?,X11R6/cat?,local/cat?}; do
    if [ -d "$d" ]; then
        /usr/sbin/tmpwatch -f 720 $d
    fi
done
Anyone have any idea why that might be bringing our server down? Or if we're even looking in the right spot?

thanks;
-rob.
griffman is offline   Reply With Quote
Old 01-22-2003, 12:19 PM   #2
jwigdahl
Prospect
 
Join Date: Jan 2002
Location: San Diego, CA
Posts: 3
I'd be interested in seeing the indications in the logfiles that led your ISP to suspect tmpwatch. Are there specs on your server posted somewhere? Hardware? OS?
jwigdahl is offline   Reply With Quote
Old 01-22-2003, 12:29 PM   #3
hayne
Site Admin
 
Join Date: Jan 2002
Location: Montreal
Posts: 32,473
monitor your tmp files

I don't have much to offer here but I would suggest that you might start monitoring your tmp files (those created by Geeklog) yourself to see if there are any which stick around - i.e. are not so temporary. You could do this with a script that gets called periodically via cron and does a find for files older than a certain amount, emailing or logging the results. If there are any files which stick around, you can then try to figure out why.
hayne is offline   Reply With Quote
Old 01-22-2003, 12:45 PM   #4
sao
Moderator
 
Join Date: Jan 2002
Location: Singapore
Posts: 4,237
What about using xtail in a script to monitor the tmp files?
sao is offline   Reply With Quote
Old 01-22-2003, 12:56 PM   #5
griffman
MVP
 
Join Date: Dec 2001
Location: Portland, OR
Posts: 1,472
Regarding how we zeroed in on the tmp files and the cleanup script ... we had this crash one previous time since the site move, and it was almost exactly 10 days ago ... within the realm of rounding, the cron job last night ran just after the 10 day mark was reached.

That's what pointed us in that direction ... and I'll certainly keep an eye on the directory in the future to see what (if anything) is accumulating there. Maybe we could modify the script to speicifically ignore select filenames if we can identify them...

-rob.
griffman is offline   Reply With Quote
Old 01-22-2003, 01:05 PM   #6
jwigdahl
Prospect
 
Join Date: Jan 2002
Location: San Diego, CA
Posts: 3
From what I've been able to gather, looks like you're running a Red Hat box. Assuming you've run up2date and everything is current? I run several RH boxes myself with this same script in place and significant activity in the tmp dirs (PHP sessions). Have yet to see it cause a problem, let alone a crash. Not sure what distro you have installed, but ths may be of interest:

https://rhn.redhat.com/errata/RHBA-2001-104.html
jwigdahl is offline   Reply With Quote
Old 01-22-2003, 01:45 PM   #7
breen
Major Leaguer
 
Join Date: Jan 2003
Location: Bay Area
Posts: 327
troubleshooting

Have you tried running that script manually? At least you'll be able to watch what happens...
breen is offline   Reply With Quote
Old 01-22-2003, 01:51 PM   #8
griffman
MVP
 
Join Date: Dec 2001
Location: Portland, OR
Posts: 1,472
Hmm, not sure how long I'd be able to watch it as it seems to shut down Apache at some point!

At the moment, the /tmp directory is completely empty; I think I'll add a script that sends a directory list to me a couple times a day just so I can see what's in it at various times.

As far as Redhat goes, I believe we're up to date, but I'll certainly double-check tonight!

Thanks for the ideas thus far...

-rob.
griffman is offline   Reply With Quote
Old 01-22-2003, 02:11 PM   #9
batmanppc
Prospect
 
Join Date: Jan 2002
Posts: 4
Take a look at the apache error log around the time you think the server went down.

It'd be useful to know if the server died or was shutdown.
batmanppc is offline   Reply With Quote
Old 01-22-2003, 02:16 PM   #10
griffman
MVP
 
Join Date: Dec 2001
Location: Portland, OR
Posts: 1,472
The last entry before five hours of downtime is this:
Code:
[Wed Jan 22 04:02:18 2003] [error] [client xxx.xx.xx.xx]
    request failed: error reading the headers
That's it ... then five hours later, new error.log entries.

-rob.
griffman is offline   Reply With Quote
Old 01-22-2003, 02:30 PM   #11
jwigdahl
Prospect
 
Join Date: Jan 2002
Location: San Diego, CA
Posts: 3
The time is definitely incriminating. 4:02AM is when Red Hat's daily cron jobs run. This could mean any script residing in the /etc/cron.daily directory though... not just tmpwatch.

Are we talking just an Apache crash? Or the whole machine? If it's just Apache, I've seen something like this when the logs get rotated. The logrotate script sends a 'kill -HUP' to apache which should cause it to reread it's config and create new logfiles (since the old ones have been rolled). This process doesn't always work though. What I've done on systems exhibiting this problem is to change a line in the file /etc/logrotate.d/apache:

You can see the line in there that sends the 'kill -HUP to apache... change the line to:

/etc/init.d/httpd restart 2> /dev/null || true

This will make apache do a full restart which might result in a few seconds of 'downtime', but is probably safer anyway.
jwigdahl is offline   Reply With Quote
Old 01-22-2003, 02:39 PM   #12
griffman
MVP
 
Join Date: Dec 2001
Location: Portland, OR
Posts: 1,472
It's not just Apache, as the server basically vanishes from existence -- ftp, ssh, etc. all fail after the crash.

-rob.
griffman is offline   Reply With Quote
Old 01-22-2003, 02:46 PM   #13
Glanz
Major Leaguer
 
Join Date: Aug 2002
Location: Montreal
Posts: 373
You might want to try this:::

# begin
# find all files not accessed within the last 7 days in /var/adm, /var/tmp and /tmp
find /var/adm -type f -atime +7 -print > /var/tmp/deadfiles-varadm &
find /var/tmp -type f -atime +7 -print > /var/tmp/deadfiles-vartmp &
find /tmp -type f -atime +7 -print > /var/tmp/deadfiles-tmp &
# remove all files not accessed within the last 7 days in /var/adm, /var/tmp, and /tmp
rm `cat /var/tmp/deadfiles-varadm`
rm `cat /var/tmp/deadfiles-vartmp`
rm `cat /var/tmp/deadfiles-tmp`
# clear "deadfiles-*" repositories
rm -r /var/tmp/*
# delete core files

# the following for Solaris only

# delete core files
find . -name core -exec rm {} \;
# delete crash dump files
rm -r /var/crash/<system>/*
#end

<system> is the hostname of the box that made the crash dump files...

Also, instead of -atime +n (finds files that haven't been accessed within +n days), you could use -mtime +n (finds files that haven't been modified within +n days); this would leave any files that had been accessed within the last n days, and may give you a clue as to which file or files are the culprit to your crashes.

If that doesn't work then you'll have to check your processes to see if any of them lock or use files in either /var/tmp or /tmp.... a very slim possibility

Last edited by Glanz; 01-22-2003 at 02:49 PM.
Glanz is offline   Reply With Quote
Old 01-22-2003, 02:53 PM   #14
batmanppc
Prospect
 
Join Date: Jan 2002
Posts: 4
Quote:
Originally posted by griffman
It's not just Apache, as the server basically vanishes from existence -- ftp, ssh, etc. all fail after the crash.

-rob.

Ok, now we're getting somewhere. Might be a hardware issue.

Have you tried running the scripts that run manually and seeing what happens?

I've seen crashes as you're describing on high disk I/O and memory usage.
batmanppc is offline   Reply With Quote
Old 01-22-2003, 03:04 PM   #15
griffman
MVP
 
Join Date: Dec 2001
Location: Portland, OR
Posts: 1,472
The scripts run every night (I believe), and they seem to succeed every night except the 10th in a row ... so is it possible that one script would trigger a hardware glitch when the others do not?

Perhaps we'll do some experimenting this weekend...

-rob.
griffman is offline   Reply With Quote
Old 01-22-2003, 03:15 PM   #16
batmanppc
Prospect
 
Join Date: Jan 2002
Posts: 4
Quote:
Originally posted by griffman
The scripts run every night (I believe), and they seem to succeed every night except the 10th in a row ... so is it possible that one script would trigger a hardware glitch when the others do not?

Perhaps we'll do some experimenting this weekend...

-rob.

Correct, something is throwing it over the edge. Since the current theory is that something happens every 10 days, it's assummed that it's tmpwatch because it's the only script that has something with that time limit. One thing to try is change the 10 days to a few hours while you're testing so that it actually runs.

You may not actually see anythign go wrong until the conditions are just right (i.e. lots of file in /tmp/).

I'm assuming each time the server crashed the ISP rebooted the machine. Was there anything on the console (if there was a monitor attached).

Also it'd be helpful to have the full specs of the server.
batmanppc is offline   Reply With Quote
Old 01-22-2003, 03:22 PM   #17
breen
Major Leaguer
 
Join Date: Jan 2003
Location: Bay Area
Posts: 327
Quote:
Originally posted by griffman
The scripts run every night (I believe),

Right. At 4:02 as an earlier poster pointed out.
Quote:
and they seem to succeed every night except the 10th in a row ... so is it possible that one script would trigger a hardware glitch when the others do not?

Hmm. Let me guess -- /etc/cron.daily/tmpwatch checks files under /tmp and /var. I bet that one or both of those is running on its own partition. That could be the source of a hw glitch.

You might want to fsck those partitions.
breen is offline   Reply With Quote
Old 01-22-2003, 07:36 PM   #18
stetner
MVP
 
Join Date: Jan 2002
Location: Brisbane, Australia
Posts: 1,108
Re: Help debug a macosxhints.com crash...

Quote:
Originally posted by griffman
The bit that cleans up the tmp directory looks like this:
Code:
% less tmpwatch
/usr/sbin/tmpwatch 240 /tmp
/usr/sbin/tmpwatch 720 /var/tmp
for d in /var/{cache/man,catman}/{cat?,X11R6/cat?,local/cat?}; do
    if [ -d "$d" ]; then
        /usr/sbin/tmpwatch -f 720 $d
    fi
done

If we look at the script above, it appears that all directories above would be searched every single night, so hardware 'read' problems should be out.

It still could be hardware as there would be much more trashing when removing a bunch of files.

tmpwatch says it only removes regular files and empty directories. So, it should leave pipes and special files alone.

It could remove an empty directory that something assumes is going to be there and that could cause application crashes when it is not (but an OS crash, hmm, I wouldn't think so).

What I would do would be to run another tmpwatch script an hour before this one runs with options:
Code:
--test
Doesn't remove files, but goes through the motions of
  removing them. This implies -v. 

-v
Print a verbose display. Two levels of verboseness are available --
  use this option twice to get the most verbose output.
and output that list to a file you can peruse (after the next crash 8-) to see if anything jumps out at you.

Actually you should be able to do this a day after the machine came back up and use 24 for hours as the offending file, if there is one, should already be there.

An fsck wouldn't be a bad idea, but I would think one would be being run after every crash anyway....

Good luck

Edit: Moderator added line breaks to "code" tags to narrow display width...
__________________
Douglas G. Stetner
UNIX Live Free Or Die

Last edited by stetner; 01-22-2003 at 07:39 PM.
stetner is offline   Reply With Quote
Old 01-22-2003, 09:51 PM   #19
plattapuss
Site Admin
 
Join Date: Jan 2002
Location: Montreal, Quebec
Posts: 34
Thanks for all the ideas on the server. Here is some more info so far:

- first and foremost, the system is up-to-date as of today.

- the logs show each cron job as it runs, tmpwatch is the last one to run before the system hangs.

- fsck is run on reboot as it does on any reboot from a crash on Redhat

I am putting in a small cron job called 'zz' to simply echo a word. This will let me see if that cron job gets run at all or if the system goes down on tmpwatch.

I will also include the hint from 'stetner' to run tmpwatch in test mode before the real tmpwatch cron job, to get a file listing. I will have it dump to an incremental file so 'if' the machine goes down again, I will have some more information.

We also ran a memory check, it all looks good.

To help us all out I have implemented a monitoring system to notify both myself and Rob if the server goes down again. So if I miss the notice, Rob can drag me out of bed in our lovely -25C (-35C with windchill) weather at 4:02 in the morning.

The machine these sites are on is a Dual 2.0GHz Xeon with 1GB of ECC DDR and two 120GB IDE drives (yes I have heard the arguments about IDE versus SCSI )

Thanks again for the help

Your friendly ISP guy
Plattapuss
plattapuss is offline   Reply With Quote
Old 01-23-2003, 01:09 AM   #20
jaysoffian
Prospect
 
Join Date: Jan 2003
Posts: 4
Quote:
Originally posted by plattapuss
Thanks for all the ideas on the server. Here is some more info so far:

- the logs show each cron job as it runs, tmpwatch is the last one to run before the system hangs.

I'd like to be clear about what is happening. I've read, crash, hang, etc. Some questions:

- Is the machine kernel panicing? If so, there should be an indication on the console.

- If the machine is not panicing, what is happening? Is it just dissappearing from the network (that is, not even pingable)? Or does it become unresponsive on the console as well?

- I'm not sure what your syslog.conf looks like, but you might consider adding an entry to catch everything:

*.* /var/log/debug.log

then touch /var/log/debug.log and HUP syslog. Depending upon how fast this grows, you might want to rotate it via logrotate.

You might also want to use the syslog mark facility to pin down the time the system is crashing/hanging, though it looks like you get enough HTTP traffic that the apache logs are sufficient to indicate the time of the crash/hang. Of course, if the machine is just dissappearing from the network but not actually hanging, then the HTTP logs will obviously stop when there is no more network, even though the server may still be up.


Good luck,

j.
jaysoffian is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump



All times are GMT -5. The time now is 05:50 PM.


Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2014, vBulletin Solutions, Inc.
Site design © IDG Consumer & SMB; individuals retain copyright of their postings
but consent to the possible use of their material in other areas of IDG Consumer & SMB.