[rhelv6-list] Problems with XFS leaving 0-length files in RHEL 6.2 ... ?

Thu Nov 8 21:24:53 UTC 2012

On Sat, 2012-10-13 at 10:27 -0400, Paul Smith wrote:
> On Sat, 2012-10-13 at 05:53 +0000, Edmund White wrote:
> > What are the names of these files? Can you see when these files are
> > created? This may not be an XFS-specific issue.

Hi all.  Back again with this problem.  For a while we were busy with
other things and we didn't see this problem so much but now it's back
with a vengeance and I need to work out the fix or a workaround.

I've managed to determine that on one of my systems I had this problem
happen (back on Oct 23!)  It appears that all of these failures (that
I've seen so far) are associated with a kernel panic or reboot of some
kind.  I created a cron job that runs hourly looking for 0-length files,
and I discovered the introduction of a whole bunch of them between 8am
and 9am on Oct 23:

# wc -l findzero.20121023080101 findzero.20121023090102
   400 findzero.20121023080101
   789 findzero.20121023090102

within that hour, 389 0-length files were created.  I looked at these
files and they included programs (executables) that had been installed
on the system the day before, Oct 22 at about 11am, over 18 hours
previously.  The timestamps on these files were still set to the Oct 22
time.  Obviously, as programs, they will not be opened for writing (just
opened for read by the runtime linker when executed).  I checked the
logs for these programs and they were successfully invoked AFTER they
were installed, but were not being used anywhere close to the hour in
question (these are command line tools, not daemons, and they write to a
log every time they are used).

I looked at /var/log/messages and sure enough, at 8:09am I see that my
system started to have some trouble:

Oct 23 08:09:12 kernel: eth1: Link down
Oct 23 08:09:12 kernel: bonding: bond0: link status definitely down for interface eth1, disabling it
Oct 23 08:09:42 kernel: eth1: Link up
Oct 23 08:09:42 kernel: bond0: link status definitely up for interface eth1, 10000 Mbps full duplex.
Oct 23 08:09:48 kernel: eth0: Link down
Oct 23 08:09:48 kernel: bonding: bond0: link status definitely down for interface eth0, disabling it
Oct 23 08:10:01 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Oct 23 08:10:01 kernel: Do you have a strange power saving mode enabled?
Oct 23 08:10:01 kernel: Dazed and confused, but trying to continue

Then the next thing in the log is the beginning of a new kernel boot, so
obviously the system went down here:

Oct 23 08:15:57 kernel: imklog 4.6.2, log source = /proc/kmsg started.
Oct 23 08:15:57 rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="2512" x-info="http://www.rsyslog.com"] (re)start

etc.

This seems bad to me.  I can accept that in the event of a kernel panic
or system crash, some files that are "in flight" might be zero'd out.
But I can't accept that it's OK to have almost 400 files that were
written 18 hours before and, all evidence shows, were not being actively
used when the crash happened, would be zero'd out like this.  Surely
this cannot be "just the price we pay"?

I'm looking at the kernel panic separately (seems to be a common thing
as well) but I need the filesystem to be more stable than this in any
event.

Anyone have any ideas?