forced fsck (again?)

Fri Jan 25 08:55:57 UTC 2008

On Jan 24, 2008  22:20 -0500, Bryan Kadzban wrote:
> #  Run this from cron each night.  If the machine is on AC power, it
> #  will run the checks; otherwise they will all be skipped.  (If the
> #  script can't tell whether the machine is on AC power, a setting in
> #  the configuration file (/etc/lvcheck.conf) decides whether it will
> #  continue with the checks, or abort.)

Probably once a week is enough, and "/etc/cron.weekly" (anacron) exists
on most systems and will ensure that if the system was off for more than
a week it will still be run on the next boot.

> #  Any LV that passes fsck will have its last-check time updated (in
> #  the real superblock, not the snapshot's superblock); any LV whose
> #  fsck fails will send an email notification to a configurable user
> #  ($EMAIL).  This $EMAIL setting is optional, but its use is highly
> #  recommended, since if any LV fails, it will need to be checked
> #  manually, offline.

I would recommend also using "logger" to log something in /var/log/messages.

> # attempt to force a check of $1 on the next reboot
> function try_force_check() {
> 	local dev="$1"
> 	local fstype="$2"
> 
> 	case "$fstype" in
> 		ext2|ext3)
> 		tune2fs -C 16000 -T "19000101" "$dev"
> 		;;
> 		reiserfs)
> 		# ???
> 		echo "Don't know how to set the last-check time on reiserfs..." >&2
> 		;;
> 		*)
> 		echo "Don't know how to set the last-check time on $fstype..." >&2
> 		;;
> 	esac
> }

These error messages are incorrect, namely "set the last-check time" should
be replaced with "force a check".  Since there isn't any reason to special
case reiserfs here, you may as well remove it.

I suspect that a nice email to the XFS and JFS folks would get them to add
some mechanism to force a filesystem check on the next reboot.

> # check the FS on $1 passively, printing output to $3.
> function perform_check() {
> 	case "$fstype" in
> 		ext2|ext3)
> 		# the only point in fixing anything is just to see if fsck can.
> 		nice logsave -as "${tmpfile}" fsck.${fstype} -p -C 0 "$dev" &&
> 			nice logsave -as "${tmpfile}" fsck.${fstype} -fy -C 0 "$dev"

Hmm, I'm not sure I understand what it is you want to do?  The fsck should
be run as 'e2fsck -fn "$dev"' (since we already know this is ext2|ext3).
Using "-C 0" isn't useful because we don't want progress in the output log,
and "-p" without "-f" will just check the superblock.  We don't want to be
fixing anything (since this should be a read-only snapshot) so "-fy" is 
also not so great.

> # do everything needed to check and reset dates and counters on /dev/$1/$2.
> function check_fs() {
> 	local tmpfile=`mktemp -t e2fsck.log.XXXXXXXXXX`
> 	trap "rm $tmpfile ; trap - RETURN" RETURN

For the log file it probably makes sense to keep this around with a
timestamp if there is a failure.  That means it is fine to generate a
random filename temporarily, but it should be renamed to something
meaningful (e.g. /var/log/lvfsck.$dev.$(date +%Y%m%d) or similar).

> 	# only one check happens at a time; using all the free space in the VG
> 	#  at least won't prevent other checks from happening...
> 	lvcreate -s -l "100%FREE" -n "${lv}-snap" "${vg}/${lv}"

To find free space, use "vgs -o vg_size --noheadings ${vg}", and the
LV size can be had from "lvs -o lv_size --noheadings ${vg}/${lv}".
You can strip the size suffixes with "--units M --nosuffix" to get
units of MB.

Also good to create a more unique name than "${lv}-snap", since that
might conflict with an existing snapshot, and if the script crashes
the user might be wondering if that LV using 100% of the free space is
safe to delete or not.

Please also add XFS support here, having it call "xfs_check", since
fsck.xfs is an empty shell...

For JFS it can also use "fsck.jfs -fn $dev" to check the filesystem.

> 	if perform_check "/dev/${vg}/${lv}-snap" "${fstype}" "${tmpfile}" ; then
> 		echo 'Background scrubbing succeeded!'
> 		try_delay_checks "/dev/${vg}/${lv}" "$fstype"
> 	else
> 		echo 'Background scrubbing failed! Reboot to fsck soon!'

Printing the device name in these messages, and sending them to the syslog
via logger would probably be more useful.

> 		try_force_check "/dev/${vg}/${lv}" "$fstype"
> 
> 		if test -n "$EMAIL"; then
> 			mail -s "Fsck of /dev/${vg}/${lv} failed!" $EMAIL < $tmpfile
> 		fi
>
> set -e

Have you verified that the script doesn't exit if an fsck fails with an
error?

> # pull in configuration -- don't bother with a parser, just use the shell's
> . /etc/lvcheck.conf

You should check that this file exists before sourcing it, or the script will
exit with an error:

[ -r /etc/lvcheck.conf ] && . /etc/lvcheck.conf

> # parse up lvscan output
> lvscan 2>&1 | grep ACTIVE | awk '{print $2;}' | \
> while read DEV ; do
> 	# remove the single quotes around the device name
> 	DEV="`echo "$DEV" | tr -d \'`"
> 
> 	# get the FS type
> 	FSTYPE="`/lib/udev/vol_id -t "$DEV"`"

Please use "blkid", since that is part of e2fsprogs already and avoids
an extra dependency.

> 	# if the date is unknown, run fsck every day.  sigh.

Better to write "run fsck each time the script is run".

> 	# get the free space
> 	SPACE="`lvs --noheadings -o vg_free "$DEV"`"
> 
> 	# ensure that some free space exists, at least
> 	#  ??? -- can lvs print vg_free in plain numbers, or do I have to
> 	#  figure out what a suffix of "m" means?  skip the check for now.

"vgs", and --nosuffix, per above.

> #!/bin/sh
> 
> # e2check configuration variables:
> #
> #  EMAIL
> #   Address to send failure notifications to.  If empty,
> #   failure notifications will not be sent.
> #
> #  INTERVAL
> #   Days to wait between checks.  All LVs use the same
> #   INTERVAL, but the "days since last check" value can
> #   be different per LV, since that value is stored in
> #   the ext2/ext3 superblock.
> #
> #  AC_UNKNOWN
> #   Whether to run the e2fsck checks if the script can't
> #   determine whether the machine is on AC power.  Laptop
> #   users will want to set this to ABORT, while server and
> #   desktop users will probably want to set this to
> #   CONTINUE.  Those are the only two valid values.
> 
> EMAIL='root'
> INTERVAL=30
> AC_UNKNOWN="ABORT"

I would also make these all be defaults in the script (before this file is
parsed), so it works as expected if /etc/lvscan.conf doesn't exist.

I'd also recommend that the default for AC_UNKNOWN be CONTINUE (or possibly
leave it unset by default and have the script not error out in this case,
so that the script does something useful for the majority of users.

If we are worried about the laptop case, we could add checks to see
if the system has a PC card, since very few desktop systems have them.
Both the commands "pccardctl info" and "cardctl info" produce no output
on stdout if there is no PC card slot, and this could be used to decide
between "CONTINUE" for desktops and "ABORT" for laptops.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.