forced fsck (again?)

Bryan Kadzban bryan at kadzban.is-a-geek.net
Sat Jan 26 02:02:56 UTC 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

Andreas Dilger wrote:
> On Jan 24, 2008  22:20 -0500, Bryan Kadzban wrote:
>> #  Run this from cron each night.
> 
> Probably once a week is enough, and "/etc/cron.weekly" (anacron) exists
> on most systems and will ensure that if the system was off for more than
> a week it will still be run on the next boot.

Yeah, it's probably true that once per week is enough.  Do you think it
would still make sense to try and parse out the last-check time from the
LV if this gets run each week, or just unconditionally check everything
(if on AC)?  Checking everything weekly might be too often (especially
if the extra disk usage ends up exposing bad bits on a disk), but maybe
not.

> I would recommend also using "logger" to log something in /var/log/messages.

Yeah, that makes sense.  logger is part of util-linux{,-ng}, so that's
not a huge extra dependency either.

>> echo "Don't know how to set the last-check time on $fstype..." >&2
> 
> These error messages are incorrect, namely "set the last-check time" should
> be replaced with "force a check".

That's true.  I was trying to get the errors to refer to what specific
information needed to be added to the script (in this case, it needs to
know how to set the last-check time), but "force a check" is probably
safer anyway.  Setting the last-check time may not be the method that
every FS uses.

> Since there isn't any reason to special
> case reiserfs here, you may as well remove it.

That's what I get for deciding to handle reiser separately everywhere,
and then changing my mind later -- I forgot to go back and remove this
case.  Oops...  :-)

> I suspect that a nice email to the XFS and JFS folks would get them to add
> some mechanism to force a filesystem check on the next reboot.

Is the issue that those FSes don't have any such mechanism today, or is
it just that I don't know how to do this on them?

(I'll have to go look up the XFS/JFS lists, too, but that's not terribly
difficult.)

>> nice logsave -as "${tmpfile}" fsck.${fstype} -p -C 0 "$dev" &&
>> 	nice logsave -as "${tmpfile}" fsck.${fstype} -fy -C 0 "$dev"
> 
> Hmm, I'm not sure I understand what it is you want to do?

Well, neither do I, necessarily -- those arguments were copied from the
initial script that I hacked the extra stuff into (the one that Ted
posted at the start of this whole thing).  :-)

I see that your script just uses -fn; that's probably simpler anyway.
What it doesn't determine is whether fsck would be able to automatically
repair the damage that it finds; I guess the question is whether this
condition should be treated as a fsck failure (requiring a reboot to
fix) or not.  It probably depends on the severity of the fixes that fsck
makes...

OTOH, if you give e2fsck the -fy option, and it does make changes, its
exit status will not be zero, so it will already be treated as a failure
by this script.  So the only difference is that -fn stops it from
writing to the snapshot just to have the writes thrown away; that's
probably actually good.

> and "-p" without "-f" will just check the superblock.

Yeah, I think the idea was to check the superblock first, and then check
the rest of the FS.  But I think -fn is probably more explicit about
what we want fsck to do, too.

(Plus, even if we do take a read-write snapshot with LVM2, there's no
point in taking up extra space by writing to the snapshot itself, if
it's just going to get thrown away.)

> For the log file it probably makes sense to keep this around with a
> timestamp if there is a failure.

And let e.g. logrotate get rid of older versions; yeah, that makes
sense.

> To find free space, use "vgs -o vg_size --noheadings ${vg}", and the
> LV size can be had from "lvs -o lv_size --noheadings ${vg}/${lv}".

Free space can also be retrieved with -o vg_free, but yeah.

> You can strip the size suffixes with "--units M --nosuffix" to get
> units of MB.

Ah, that was the bit I was missing yesterday (further down in the
script): --nosuffix.  Thanks!

I also just got your message from yesterday about the guess behind the
<LV size/500> (based on the frequency of writes to the main LV); that
makes sense.  And since I can get the size out of lvs, that makes that
much easier, too, so I'll just use 1/500th the LV size.

> Also good to create a more unique name than "${lv}-snap", since that
> might conflict with an existing snapshot, and if the script crashes
> the user might be wondering if that LV using 100% of the free space is
> safe to delete or not.

Yeah, that was left over from the original script as well.  Changing it
makes sense.

> Please also add XFS support here,

Done, I think.  I assume xfs_check doesn't need any args?

(Should fsck.xfs perhaps just exec xfs_check and pass it all the args?
That's a whole separate discussion, probably.)

> For JFS it can also use "fsck.jfs -fn $dev" to check the filesystem.

Done.

>> echo 'Background scrubbing succeeded!'
>> echo 'Background scrubbing failed! Reboot to fsck soon!'
> 
> Printing the device name in these messages, and sending them to the syslog
> via logger would probably be more useful.

True; done.  The severity may need a bit of tweaking, but hopefully not
much.

>> set -e
> 
> Have you verified that the script doesn't exit if an fsck fails with an
> error?

No, the script exits if fsck fails with an error.  That's obviously bad
- -- I wasn't thinking that far ahead when I added that.  It's gone now.

>> . /etc/lvcheck.conf
> 
> You should check that this file exists before sourcing it, or the script will
> exit with an error

That was intended; I figured the config file would be required (back
when I first added it).  But since we have decent default values for the
settings in it, it probably makes sense to make it optional now.

>> FSTYPE="`/lib/udev/vol_id -t "$DEV"`"
> 
> Please use "blkid", since that is part of e2fsprogs already and avoids
> an extra dependency.

True.  Looking at the manpages, it appears that vol_id does some extra
checks to try to detect RAID members as RAID members, instead of
partitions containing a filesystem.  But that would only affect this
script if someone had multiple LVs RAIDed together, and I doubt that's
well-supported elsewhere, so blkid is fine.

>> # if the date is unknown, run fsck every day.  sigh.
> 
> Better to write "run fsck each time the script is run".

Yeah, that makes more sense.

>> #  ??? -- can lvs print vg_free in plain numbers, or do I have to
>> #  figure out what a suffix of "m" means?  skip the check for now.
> 
> "vgs", and --nosuffix, per above.

Yep, done.

>> EMAIL='root'
>> INTERVAL=30
>> AC_UNKNOWN="ABORT"
> 
> I would also make these all be defaults in the script (before this file is
> parsed), so it works as expected if /etc/lvscan.conf doesn't exist.

Since it's now optional, yes, that makes sense.

> I'd also recommend that the default for AC_UNKNOWN be CONTINUE (or possibly
> leave it unset by default and have the script not error out in this case,
> so that the script does something useful for the majority of users.

Well, it depends on whether the majority of users have laptops, or some
other hardware type (desktops, servers, etc.).  I was thinking that
laptops would be more prevalent, but since this is Linux, it's probably
actually servers.  OK -- CONTINUE it is, by default.

> If we are worried about the laptop case, we could add checks to see
> if the system has a PC card, since very few desktop systems have them.
> Both the commands "pccardctl info" and "cardctl info" produce no output
> on stdout if there is no PC card slot, and this could be used to decide
> between "CONTINUE" for desktops and "ABORT" for laptops.

Or stuff it into comments in the config file.  Pushing the decision back
onto the user makes me a bit uncomfortable, but fuzzy decisions (ones
that aren't necessarily based on the right info) make me even less
comfortable.  Hmm.  And depending how the power_supply sysfs class ends
up working, maybe this is all a moot point anyway: if it always has
devices under it on >=2.6.24, then the setting won't even matter.

For now, I'll just leave the default CONTINUE, but with comments in the
config file aimed at laptop users.

- ----

Create a script to transparently run fsck in the background on any
active LVM logical volumes, as long as the machine is on AC power, and
that LV has been last checked more than a configurable number of days
ago.  Also create an optional configuration file to set various options
in the script.

Signed-Off-By: Bryan Kadzban <bryan at kadzban.is-a-geek.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHmpTOS5vET1Wea5wRA2XXAKCZzt9SEOSBVs4EkrI4gt3Ztl0v5wCg3gq5
1ChmnEccT+hFVo/2B/RpU8U=
=D4HV
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: lvcheck
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080125/12ce214b/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: lvcheck.conf
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080125/12ce214b/attachment.conf>


More information about the Ext3-users mailing list