RE: Nasty ext3 errors 2.4.18


Thanks for the reply.

The assertion is easy enough to explain - ext3 reserves a fixed number
of journal blocks to do a transaction, but as a result of the many
errors that happen it runs out of reserved blocks while correcting the
errors before it can complete the operation.  I'm not sure whether that
should be "fixed" or not (e.g. we could try to extend the transaction
if we hit such an error case), because having so many errors is just
for further corruption down the road.
Granted - although the hanging process is a major nuisance - I could
turn on reboot on panics although then I'd probably not notice that
anything had happened!

Well, any kind of prior error is usually a bad sign, because it could
mean memory corruption, not enough free memory to do operations, etc.
Yeah I agree and they do worry me, although I've seen them on other
boxes (very occasionally) and the box has continued on for months - I
don't know if this is a 2.4.18 bug - I've seen others report it on
linux.kernel. I'm loathed to change kernels to 2.4.20 as I've seen
reports of ext3 errors introduced (however minor) - so I was going to
wait for 2.4.21 (which I see is in pre form at the moment) before trying
that out on a customer - however I have been using 2.4.21-pre11 on
non-customer sites and that has appeared to be ok so that is a
possibility - although I don't like trying out new kernels on customers
to see if it 'cures' problems (not unless I have proven to myself that
it cures a problem - but so far I've not reproduced this disk problem in
the office!).

One option is always to disable DMA on the IDE chipset in case that is a
source of problems.  Not to deny the possibility that the error is in
ext3, but it is also possible that the problem is in the capture cards
or drivers, or bad interaction on the PCI bus or something.
Yeah it's a bit of a nightmare pinning the problem down - I've got boxes
on other sites running under the same load (and higher) which have not
shown any problems at all - same hardware. It is possible that we've got
a run of bad boards - we bought 15 from the same supplier.

It may be load related, if you are not stressing the box as much as the
customers are...  Is it possible to configure only a single capture card
in a box for some period of time?
It's really only possible to do that in our office - generally we push
the boxes just as hard as the customer will (harder in some cases) and
try and let them run for as long as possible before shipping them out (1
week usually).

Thanks again,


