[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: kjournald panic in 2.4.20 RedHat 7.2

Hi, Thanks for the reply. I received another response directly and want
to include my reply here:

> > Did you use the ext3 patches for the vanilla 2.4.20?
> >
> > http://www.zip.com.au/~akpm/linux/ext3/
> ...
> I have not applied those patches, just bare
> 2.4.20 with the cyclades and ipsec patches. It will be difficult to do
> since the machine is in a colocation cabinet and will be down at least
> one hour should something go wrong. Do you recommend I stick with ext2
> for now? Is the switchover really as easy as it sounds? I love the quick
> recovery times in ext3 but since it is a production machine with millions of
> users' files I cant manage applying kernel updates regularly, as cop out as
> that may sound... In this case do you recommend I just stick with ext2?

Any comments?

> > mofo kernel: Assertion failure in journal_stop() at transaction.c:1384: "journal_current_handle() == handle"
> Odd.  That is one assert failure I have _never_ seen reported.  Handle
> mismatches in journal_start have happened from time to time when there
> has been illegal recursion in the VM, but not in journal_stop.
> The most likely cause would seem to me to be a stack overflow --- the
> per-process field which holds the journal handle is right at the end of
> the task struct, so it's one of the first fields to be clobbered in the
> event of a stack overflow.
> If that has happened, it's not due to ext3 --- the stack here isn't
> close to being that sort of size --- but it's entirely possible that
> there were IRQ routines operating during the function which overflowed
> the stack.
> In particular, we've seen that happen before with heavy network
> activity, especially with multiple NICs, because the random sampling
> that occurs for /dev/random during NIC activity was a heavy stack user.
> There's a patch to address that in the very latest Marcelo kernel
> trees.  It reduces the stack usage of the random sampling by several
> hundred bytes.  The fix is in the 2.4.21-pre7 kernel.

No surprise you have not seen it. I found nothing about it on Google groups
searching with various strings. The whole issue is possibly moot since I
do not have the patch mentioned above applied to the kernel.

I think my machine matches your profile for heavy network activity, with two
fast ethernet interfaces and a T1 link. There is a lot piled on the
machine at the moment as we are in the process of migrating colocations
and are short a few machines. At the moment this fine server is acting as
a firewall, a zebra/bgp router on two interfaces, a mysql server, a file server
to NT clients (smbd), and it runs some heavy cpu processes for document
conversions. Granted that is a lot of stuff but at the time of the crash
the machine was not doing much, other than copying the file.  His network
activity will peak at about 2 Mb/sec on one ethernet and 1.5 Mb/sec on his T1
and run all day like this. Though there could have been a burst in activity
overall it was very quiet at the time from network and processor load

I am not totally clear on the network load theory. Is the succeptibility to
stack overflow something particular to kjournald, or is it that the network
load could cause a crash in any kernel process? Could it crash a non-kernel
process, like mysqld? This really is the first time the machine has crashed,
and I have seen it run smoothly every afternoon at load averages of 5 or 6
during peak periods and saturating its T1. Relatively the fast ethernet loads
are not very high. I am a little skeptical of this since we have been running
a dozen servers with similar setups (less the T1 interface, ie. mysql, samba,
external scsi raid, firewalling on 2 interfaces) for several years with ext2
and have never seen a crash that was not due to hardware failure. Granted, we
are not running kjournald to crash :) Its just that I have never seen network
loading cause a crash in much more heavily loaded servers. It could be some
external interference that caused the stack overflow but the network activity
was really low. Maybe the scsi activity? I might also suspect the pc300 driver
as I have not used the card before, but then again 1.545 Mbps with 1500 kB
packets doesn't create many ISRs.

One note on the scsi activity... With the machine online one night 2 weeks back
I copied about one million files consuming about 60 GB from one of the ext2 scsi
partitions to sda4 (ext4) without a problem.

> > The file I was moving as you can see is a 2 GB file, ie. right at the limit of
> > ext2 capacity, and I am wondering if this is the culprit.
> No, ext2/3 can both operate beyond 2GB quite safely.

ext2 definitely can only handle files less than 2GB in size. If you script
something to write past this limit to a file (or, ehem forget to truncate a
large table and mysqld does it for you) you will see the file gets to
2147483647 bytes and any new writes will block or fail. This is the size of
the file I was copying.

So getting back to my self interests here... what would you recommend I do?
It sounds like you believe using ext2 will not improve things, ie. that the
network ISR actvity is a likely culprit. Should I try the patches here


or try 2.4.21-pre7? In the latter should I still apply the 2.4.20 patches?

Also, do you have a ballpark figure on how time consuming it would be to
convert my ext2 partions to ext3, with them unmounted? One is 150 GB and
the other 190 GB. each partition has between 500k and 3 MM files across
maybe 15 directories, if thats a factor. Are we talking 20 minutes or 5+ hours?
ext2 fscks on the 150 GB partitions can take 4 hours. I may opt for using ext2
for now and switching back to ext3 when I can physically mess with the server
to do the kernel updates, as much as I hate to do that. The uptime benefits of
ext3 are too good to ignore.

I really appreciate your help.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]