The macosxhints Forums

The macosxhints Forums (http://hintsforums.macworld.com/index.php)
-   UNIX - General (http://hintsforums.macworld.com/forumdisplay.php?f=16)
-   -   Help debug a macosxhints.com crash... (http://hintsforums.macworld.com/showthread.php?t=8789)

griffman 01-22-2003 10:11 AM

Help debug a macosxhints.com crash...
 
In case you didn't notice (I certainly did!), macosxhints.com and the forum site were offline for a few hours this morning. My ISP has managed to deduce the following by looking at the log files:
Quote:

The cron job that seems to end the server's life is tmpwatch, which cleans up files that are 240 and 720 hours old in various directories. So my logic is if it cleans up files in /tmp that are 240 hours old, then it would coincide with the last crash.
We figured this out as it was exactly 10 days ago when we had our last outage, so we're very suspicious of this script. The bit that cleans up the tmp directory looks like this:
Code:

% less tmpwatch
/usr/sbin/tmpwatch 240 /tmp
/usr/sbin/tmpwatch 720 /var/tmp
for d in /var/{cache/man,catman}/{cat?,X11R6/cat?,local/cat?}; do
    if [ -d "$d" ]; then
        /usr/sbin/tmpwatch -f 720 $d
    fi
done

Anyone have any idea why that might be bringing our server down? Or if we're even looking in the right spot?

thanks;
-rob.

jwigdahl 01-22-2003 12:19 PM

I'd be interested in seeing the indications in the logfiles that led your ISP to suspect tmpwatch. Are there specs on your server posted somewhere? Hardware? OS?

hayne 01-22-2003 12:29 PM

monitor your tmp files
 
I don't have much to offer here but I would suggest that you might start monitoring your tmp files (those created by Geeklog) yourself to see if there are any which stick around - i.e. are not so temporary. You could do this with a script that gets called periodically via cron and does a find for files older than a certain amount, emailing or logging the results. If there are any files which stick around, you can then try to figure out why.

sao 01-22-2003 12:45 PM

What about using xtail in a script to monitor the tmp files?

griffman 01-22-2003 12:56 PM

Regarding how we zeroed in on the tmp files and the cleanup script ... we had this crash one previous time since the site move, and it was almost exactly 10 days ago ... within the realm of rounding, the cron job last night ran just after the 10 day mark was reached.

That's what pointed us in that direction ... and I'll certainly keep an eye on the directory in the future to see what (if anything) is accumulating there. Maybe we could modify the script to speicifically ignore select filenames if we can identify them...

-rob.

jwigdahl 01-22-2003 01:05 PM

From what I've been able to gather, looks like you're running a Red Hat box. Assuming you've run up2date and everything is current? I run several RH boxes myself with this same script in place and significant activity in the tmp dirs (PHP sessions). Have yet to see it cause a problem, let alone a crash. Not sure what distro you have installed, but ths may be of interest:

https://rhn.redhat.com/errata/RHBA-2001-104.html

breen 01-22-2003 01:45 PM

troubleshooting
 
Have you tried running that script manually? At least you'll be able to watch what happens...

griffman 01-22-2003 01:51 PM

Hmm, not sure how long I'd be able to watch it as it seems to shut down Apache at some point!

At the moment, the /tmp directory is completely empty; I think I'll add a script that sends a directory list to me a couple times a day just so I can see what's in it at various times.

As far as Redhat goes, I believe we're up to date, but I'll certainly double-check tonight!

Thanks for the ideas thus far...

-rob.

batmanppc 01-22-2003 02:11 PM

Take a look at the apache error log around the time you think the server went down.

It'd be useful to know if the server died or was shutdown.

griffman 01-22-2003 02:16 PM

The last entry before five hours of downtime is this:
Code:

[Wed Jan 22 04:02:18 2003] [error] [client xxx.xx.xx.xx]
    request failed: error reading the headers

That's it ... then five hours later, new error.log entries.

-rob.

jwigdahl 01-22-2003 02:30 PM

The time is definitely incriminating. 4:02AM is when Red Hat's daily cron jobs run. This could mean any script residing in the /etc/cron.daily directory though... not just tmpwatch.

Are we talking just an Apache crash? Or the whole machine? If it's just Apache, I've seen something like this when the logs get rotated. The logrotate script sends a 'kill -HUP' to apache which should cause it to reread it's config and create new logfiles (since the old ones have been rolled). This process doesn't always work though. What I've done on systems exhibiting this problem is to change a line in the file /etc/logrotate.d/apache:

You can see the line in there that sends the 'kill -HUP to apache... change the line to:

/etc/init.d/httpd restart 2> /dev/null || true

This will make apache do a full restart which might result in a few seconds of 'downtime', but is probably safer anyway.

griffman 01-22-2003 02:39 PM

It's not just Apache, as the server basically vanishes from existence -- ftp, ssh, etc. all fail after the crash.

-rob.

Glanz 01-22-2003 02:46 PM

You might want to try this:::

# begin
# find all files not accessed within the last 7 days in /var/adm, /var/tmp and /tmp
find /var/adm -type f -atime +7 -print > /var/tmp/deadfiles-varadm &
find /var/tmp -type f -atime +7 -print > /var/tmp/deadfiles-vartmp &
find /tmp -type f -atime +7 -print > /var/tmp/deadfiles-tmp &
# remove all files not accessed within the last 7 days in /var/adm, /var/tmp, and /tmp
rm `cat /var/tmp/deadfiles-varadm`
rm `cat /var/tmp/deadfiles-vartmp`
rm `cat /var/tmp/deadfiles-tmp`
# clear "deadfiles-*" repositories
rm -r /var/tmp/*
# delete core files

# the following for Solaris only

# delete core files
find . -name core -exec rm {} \;
# delete crash dump files
rm -r /var/crash/<system>/*
#end

<system> is the hostname of the box that made the crash dump files...

Also, instead of -atime +n (finds files that haven't been accessed within +n days), you could use -mtime +n (finds files that haven't been modified within +n days); this would leave any files that had been accessed within the last n days, and may give you a clue as to which file or files are the culprit to your crashes.

If that doesn't work then you'll have to check your processes to see if any of them lock or use files in either /var/tmp or /tmp.... a very slim possibility

batmanppc 01-22-2003 02:53 PM

Quote:

Originally posted by griffman
It's not just Apache, as the server basically vanishes from existence -- ftp, ssh, etc. all fail after the crash.

-rob.
Ok, now we're getting somewhere. Might be a hardware issue.

Have you tried running the scripts that run manually and seeing what happens?

I've seen crashes as you're describing on high disk I/O and memory usage.

griffman 01-22-2003 03:04 PM

The scripts run every night (I believe), and they seem to succeed every night except the 10th in a row ... so is it possible that one script would trigger a hardware glitch when the others do not?

Perhaps we'll do some experimenting this weekend...

-rob.

batmanppc 01-22-2003 03:15 PM

Quote:

Originally posted by griffman
The scripts run every night (I believe), and they seem to succeed every night except the 10th in a row ... so is it possible that one script would trigger a hardware glitch when the others do not?

Perhaps we'll do some experimenting this weekend...

-rob.
Correct, something is throwing it over the edge. Since the current theory is that something happens every 10 days, it's assummed that it's tmpwatch because it's the only script that has something with that time limit. One thing to try is change the 10 days to a few hours while you're testing so that it actually runs.

You may not actually see anythign go wrong until the conditions are just right (i.e. lots of file in /tmp/).

I'm assuming each time the server crashed the ISP rebooted the machine. Was there anything on the console (if there was a monitor attached).

Also it'd be helpful to have the full specs of the server.

breen 01-22-2003 03:22 PM

Quote:

Originally posted by griffman
The scripts run every night (I believe),
Right. At 4:02 as an earlier poster pointed out.
Quote:

and they seem to succeed every night except the 10th in a row ... so is it possible that one script would trigger a hardware glitch when the others do not?
Hmm. Let me guess -- /etc/cron.daily/tmpwatch checks files under /tmp and /var. I bet that one or both of those is running on its own partition. That could be the source of a hw glitch.

You might want to fsck those partitions.

stetner 01-22-2003 07:36 PM

Re: Help debug a macosxhints.com crash...
 
Quote:

Originally posted by griffman
The bit that cleans up the tmp directory looks like this:
Code:

% less tmpwatch
/usr/sbin/tmpwatch 240 /tmp
/usr/sbin/tmpwatch 720 /var/tmp
for d in /var/{cache/man,catman}/{cat?,X11R6/cat?,local/cat?}; do
    if [ -d "$d" ]; then
        /usr/sbin/tmpwatch -f 720 $d
    fi
done


If we look at the script above, it appears that all directories above would be searched every single night, so hardware 'read' problems should be out.

It still could be hardware as there would be much more trashing when removing a bunch of files.

tmpwatch says it only removes regular files and empty directories. So, it should leave pipes and special files alone.

It could remove an empty directory that something assumes is going to be there and that could cause application crashes when it is not (but an OS crash, hmm, I wouldn't think so).

What I would do would be to run another tmpwatch script an hour before this one runs with options:
Code:

--test
Doesn't remove files, but goes through the motions of
  removing them. This implies -v.

-v
Print a verbose display. Two levels of verboseness are available --
  use this option twice to get the most verbose output.

and output that list to a file you can peruse (after the next crash 8-) to see if anything jumps out at you.

Actually you should be able to do this a day after the machine came back up and use 24 for hours as the offending file, if there is one, should already be there.

An fsck wouldn't be a bad idea, but I would think one would be being run after every crash anyway....

Good luck

Edit: Moderator added line breaks to "code" tags to narrow display width...

plattapuss 01-22-2003 09:51 PM

Thanks for all the ideas on the server. Here is some more info so far:

- first and foremost, the system is up-to-date as of today.

- the logs show each cron job as it runs, tmpwatch is the last one to run before the system hangs.

- fsck is run on reboot as it does on any reboot from a crash on Redhat

I am putting in a small cron job called 'zz' to simply echo a word. This will let me see if that cron job gets run at all or if the system goes down on tmpwatch.

I will also include the hint from 'stetner' to run tmpwatch in test mode before the real tmpwatch cron job, to get a file listing. I will have it dump to an incremental file so 'if' the machine goes down again, I will have some more information.

We also ran a memory check, it all looks good.

To help us all out I have implemented a monitoring system to notify both myself and Rob if the server goes down again. So if I miss the notice, Rob can drag me out of bed in our lovely -25C (-35C with windchill) weather at 4:02 in the morning.

The machine these sites are on is a Dual 2.0GHz Xeon with 1GB of ECC DDR and two 120GB IDE drives (yes I have heard the arguments about IDE versus SCSI :) )

Thanks again for the help

Your friendly ISP guy
Plattapuss

jaysoffian 01-23-2003 01:09 AM

Quote:

Originally posted by plattapuss
Thanks for all the ideas on the server. Here is some more info so far:

- the logs show each cron job as it runs, tmpwatch is the last one to run before the system hangs.
I'd like to be clear about what is happening. I've read, crash, hang, etc. Some questions:

- Is the machine kernel panicing? If so, there should be an indication on the console.

- If the machine is not panicing, what is happening? Is it just dissappearing from the network (that is, not even pingable)? Or does it become unresponsive on the console as well?

- I'm not sure what your syslog.conf looks like, but you might consider adding an entry to catch everything:

*.* /var/log/debug.log

then touch /var/log/debug.log and HUP syslog. Depending upon how fast this grows, you might want to rotate it via logrotate.

You might also want to use the syslog mark facility to pin down the time the system is crashing/hanging, though it looks like you get enough HTTP traffic that the apache logs are sufficient to indicate the time of the crash/hang. Of course, if the machine is just dissappearing from the network but not actually hanging, then the HTTP logs will obviously stop when there is no more network, even though the server may still be up.


Good luck,

j.

plattapuss 01-23-2003 07:52 AM

Hello,

- The Kernel is not panicking.
- The machine locks up entirely. At the console too.

I will look in to the debug.log idea, though it will grow pretty fast with what we have going through the system. Maintainability might be a concern.

The actualy time of the hang is 4:02 based on various logs and activity. Since the whole machine hangs, we feel pretty confident that this is an accurate time. As for the actual second of the hang, we think it might be at 4:02:49. Still checking on that one. :)

Although not normally a huge problem, it is looking more like something that tmpwatch is doing from our searching.

Thanks again for all the ideas, we will keep everyone posted!

Your friendly (and cold) ISP
Plattapuss

plattapuss 01-27-2003 09:11 AM

Interesting update.

Yesterday we had a similar hang. This time while running slocate in the cron jobs. This cron also works with the tmp directories. We are still investigating.

Plattapuss

breen 01-27-2003 12:28 PM

Hmmm...
 
slocate.cron on my box reads:
#!/bin/sh
renice +19 -p $$ >/dev/null 2>&1
/usr/bin/updatedb -f "nfs,smbfs,ncpfs,proc,devpts" -e "/tmp,/var/tmp,/usr/tmp,/afs,/net"

Which is explicitly ignoring any of the tmp directories and also various net filesystems.

What it's not excluding, I notice, is iso9660 and /mnt.

Do you possibly have a tripwire database on a CD mounted somewhere?

(Yeah, it's a shot in the dark...)

plattapuss 01-27-2003 12:34 PM

Okay, so I looked through slocate when I was still asleep :)

Yes it ignores those directories. I will cross-reference the ignored ones with the ones that tmpwatch looks at and see what shows.

I have no tripwire db on a cd, nope. Good thought though.

Again, thanks for all the suggestions.

Plattapuss

griffman 01-28-2003 08:18 AM

Yup, again...
 
Last night, same time ... argh. We crashed at 4:06am this time -- see below, as all crons ran this time.

[edit - cron is not causing the problem, at least not obviously]

-rob.

plattapuss 01-28-2003 08:35 AM

All cron jobs ran succesfully this time. The server went down 4 minutes after the last cron job, and the last cron job was a simple ssript to email me, telling me all the cron jobs ran successfully.

So tonight, we are moving the time of the cron jobs to a decent hour when I and Rob are up so we can monitor the events closely. That hint, thanks to Rob :)

Plattapuss

griffman 01-29-2003 12:27 PM

Update...
 
So last night, we set cron to run at 10:02pm. And it did. All jobs completed. Then the server crashed.

After the reboot, Plattapuss re-ran each of the 16 crons by hand. All worked fine and the server stayed up. Quite odd.

Somehow during all our back and forth discussion, he decided to disable the ensim (the server management software) backup cron job. When we then re-ran the cron task (the automated way), it worked fine. It also worked fine again this morning in a second round of testing.

We'll see what happens tonight at 10:00pm, but it looks like this backup cron script was somehow the culprit. Now to figure out why...

-rob.

Craig R. Arko 02-04-2003 03:08 PM

Was there another event this morning? About 5:04 AM Central time, Feb. 4th? Or do things go offline briefly as part of the normal cron tasks?

griffman 02-04-2003 03:48 PM

Yep, it was offline again last night. But this time, we had all sorts of extra monitoring stuff set up, so hopefully we'll get better information on the problem...

-rob.

plattapuss 02-12-2003 04:19 AM

As you have probably all noticed, the MacOSXHints sites were down for a few hours tonight. Well, 8 hours to be precise. This downtime was needed to, hopefully once and for all, put an end to our server issues.

After much debugging we think we have found the problem, though we will not be entirely sure for a couple of weeks.

I do apologize for such a long down time, and hopefully we won't have to endure any more down time for a long long long time :(

Good night

Plattapuss

wiseguy 02-14-2003 06:05 AM

Considering www.microsoft.de (the german Microsoft site) was down this morning for more than 3 hours you're doing a great job considering that they maybe have 30 times the manpower you have for troublshooting. ;)

plattapuss 02-19-2003 09:00 AM

To followup, our server was swapped out for a complete hardware change last week. Since then we had no crashes...until this morning at about 4:04. Same problem again.

So we can pretty safely rule out a hardware issue at this point. We are now investigating the issues that our version of Redhat might have with the Dual Xeon machines.

As always, if anyone has any thoughts...

Plattapuss

Craig R. Arko 02-19-2003 09:06 AM

Check the RedHat driver for the NIC. MacFixIt had a similar problem and traced it to needing an ethernet driver update from Intel.

griffman 02-20-2003 11:33 AM

The problems seem to have returned; cron has crashed the machine each of the last two nights. I talked to the MacFixIt folks; unfortunately, that's not our problem (they were getting log messages about the card prior to failure; our machine just vanishes).

Beginning tonight, we're going to start a form of a "binary tree" analysis by splitting the cron jobs in half. If we can figure out which half crashes the box, we'll split them again, etc. If they don't crash the machine when split the first time, I guess we'll just leave it that way for a while.

I know its frustrating for the readers, and very frustrating for Plattapuss and I -- this is now a 100% new machine (motherboard, RAM, hard drives, cards, etc.) and yet the problem persists, so one would now assume it's a software problem...

Sorry for the downtime, and as always, troubleshooting suggestions are welcome.

-rob.

plattapuss 02-20-2003 12:30 PM

I second what he said :)

Plattapuss

stetner 02-21-2003 01:52 AM

I have had a similar problem with HP hardware, at backup time one night one machine just up and hung. Panic ensued, the machine was power cycled, came up all OK. A few nights later it did it again. It was determined to be the drive firmware. It was updated and all was fine. A few days later another server did the exact same thing (before we could schedule down time and check all our machines of course), same old firmware....

All up three machines 'disappeared' like yours is doing. Firmware updates fixed two of three problem machines (one would not accept upgrade so HW was replaced) and all our other machines were updated before any trouble occurred.

Now I know you have replaced disk hardware, but could it be the same firmware on those new drives??

Our problem came out of the blue as well, these machines were running for 6-12-18 months with no problems, one day Ack! and all within two weeks!

Hope you find it soon though....

plattapuss 02-21-2003 09:13 AM

Great! I am looking into it as we speak.

Plattapuss

breen 02-28-2003 11:47 AM

Something to look for
 
I did some research yesterday based on a post to another list and discovered that RH 7.3 has a bug in logrotate.

It gets tickled by code like this in /etc/logrotate.d/mailman:

Code:

/var/log/mailman/* {
    missingok
}

The wildcard causes logrotate to go into a loop rotating the logfiles and, according to the person who posted the problem report, to consume all available cycles.

I imagine it would also cause you to run out of inodes on /var if this went on long enough.

I'd look at all the files in /etc/logrotate.d to see if you've got a rogue wildcard.

Breen

plattapuss 02-28-2003 12:19 PM

Good thought. I wish that were the solution. But alas, we do not have any wild cards in any of our logrotate.d conf files.

We do have another new machine and harddrive again. This time a completely different brand and size of drive. The thought still being that this is caused by an issue with an incompatability with the Dual Xeon and the drive make and size we were using.

So far we are excited over:

12:18pm up 3 days, 18:01

:rolleyes:

Plattapuss

breen 02-28-2003 01:00 PM

Quote:

Originally posted by plattapuss
Good thought. I wish that were the solution. But alas, we do not have any wild cards in any of our logrotate.d conf files.
Well, it was worth a try...

Quote:

So far we are excited over:

12:18pm up 3 days, 18:01

Plattapuss
So it's umm... six days, six hours to the moment of reckoning.

We'll keep our fingers crossed...

Breen

KING_PEACH 05-25-2003 01:03 AM

Systems Crash Auto Reboot Script
 
I have kind of the same problem, My Redhat system seems to crash on time every time. like 4.20 am every saturday morning. and around 8.20 pm saturday night. I belive this crash is due to a mail cron, that is sending out my members mail and reminders to the users on my site. Yes my logs show no reason for a crash and no errors are loged during crash time. I did just now add the log to my syslog. ( thank you ) hoping it will help me trouble shot my problem, But what i would realy like is some type of script that would reboot my pc when this crash happends. Right now I have to run in town every time just to flick a switch so it will run fine all week. Does any one have a script that can reboot a pc on crash. ? I was thinking wake up on lan but that basicaly only boots the pc from a cold shutdown.. Since ssh no longer works when the pc crashes.. an auto reboot script would be great.

Please help.. Peach :confused:

plattapuss 05-25-2003 09:20 AM

Just realized that we never did post our findings about the crashes. Turns out it WAS a harddrive problem. we had an incompatability between RH and the drives we were using.

Since my servers are at a colo, I never did know exactly which drives we had, suffice to say they where two 120GB drives. Not RAID format.

Apparently it was the thrashing that occurs during the cron hours, pushed an inode count way to high and then RH would crap out once the drives were accesses after the inode count was way up there.

So, we dropped in two 80GB drives from a different manufacturer and all has been well with over 89 days of uptime.

Thanks to everyone who contributed.

Plattapuss

KING_PEACH 05-25-2003 07:56 PM

Thanks
 
Thanks plattapuss it was interesting reading your artical and i did get help from it as well. .. hard drive. hmmmm........:rolleyes:


All times are GMT -5. The time now is 05:36 PM.

Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2014, vBulletin Solutions, Inc.
Site design © IDG Consumer & SMB; individuals retain copyright of their postings
but consent to the possible use of their material in other areas of IDG Consumer & SMB.