forced fsck (again?)

Mon Jan 28 17:48:04 UTC 2008

On Jan 25, 2008  21:02 -0500, Bryan Kadzban wrote:
> > I suspect that a nice email to the XFS and JFS folks would get them to add
> > some mechanism to force a filesystem check on the next reboot.
> 
> Is the issue that those FSes don't have any such mechanism today, or is
> it just that I don't know how to do this on them?

I don't think they have any such mechanism (at least not one that I know
about), but I think they will find it useful to add.

> (Should fsck.xfs perhaps just exec xfs_check and pass it all the args?
> That's a whole separate discussion, probably.)

Right...

> Create a script to transparently run fsck in the background on any
> active LVM logical volumes, as long as the machine is on AC power, and
> that LV has been last checked more than a configurable number of days
> ago.  Also create an optional configuration file to set various options
> in the script.
> 
> Signed-Off-By: Bryan Kadzban <bryan at kadzban.is-a-geek.net>

> #!/bin/sh
> #
> # lvcheck
> 
> # send $2 to syslog, with severity $1
> # severities are emerg/alert/crit/err/warning/notice/info/debug
> function log() {
> 	local sev="$1"
> 	local msg="$2"
> 	local arg=
> 
> 	# log warning-or-higher messages to stderr as well
> 	[ "$sev" == "emerg" || "$sev" == "alert" || "$sev" == "crit" || \
> 			"$sev" == "err" || "$sev" == "warning" ] && arg=-s
> 
> 	logger $arg -p user."$sev" -- "$msg"
> }

This should use "-t lvcheck" so that it reports what program is generating
the message.

> # attempt to force a check of $1 on the next reboot
> function try_force_check() {
> 	local dev="$1"
> 	local fstype="$2"
> 
> 	case "$fstype" in
> 		ext2|ext3)
> 		tune2fs -C 16000 -T "19000101" "$dev"

I'm a tiny bit reluctant to overwrite the "last checked" date, since this
might be useful information for the administrator (i.e. it will tell the
interval wherein the corruption was detected).  Setting the "mount count"
is enough to force a check, and the mount count itself can be reverse
engineered from "reboot" messages in the "last" log.

> # attempt to set the last-check time on $1 to now, and the mount count to 0.
> function try_delay_checks() {
> 	local dev="$1"
> 	local fstype="$2"
> 
> 	case "$fstype" in
> 		ext2|ext3)

It is a lot clearer if the "cases" (ext2|ext3|ext4) are aligned with the
"case" statement, like below, since that provides a better separation:

	case "$fstype" in
	ext2|ext3|ext4)
		tune2fs -C 0 -T now "$dev"
		;;

>	reiserfs)
>		# do nothing?
		;;

I thought you were going to remove the empty reiserfs cases?

> # check the FS on $1 passively, saving output to $3.
> function perform_check() {
> 	local dev="$1"
> 	local fstype="$2"
> 	local tmpfile="$3"
> 
> 	case "$fstype" in
> 		ext2|ext3)

Ditto on indenting the cases.

> # do everything needed to check and reset dates and counters on /dev/$1/$2.
> function check_fs() {
> 	local vg="$1"
> 	local lv="$2"
> 	local fstype="$3"
> 	local snapsize="$4"
> 
> 	local tmpfile=`mktemp -t e2fsck.log.XXXXXXXXXX`

Shouldn't be "e2fsck.log"?  Maybe "lvcheck.log.XXXXXXXXX"?

> 	local errlog="/var/log/lvcheck-${vg}@${lv}-`date +'%Y%m%d'`"
> 	local snaplvbase="${lv}-lvcheck-temp"
> 	local snaplv="${snaplvbase}-`date +'%Y%m%d'`"
> 
> 	# clean up any left-over snapshot LVs
> 	for lvtemp in /dev/${vg}/${snaplvbase}* ; do
> 		if [ -e "$lvtemp" ] ; then
> 			# Assume the script won't run more than one instance at a time?
> 			lvremove -f "${lvtemp##/dev}"

Should check the error return and bail out of script if there is an error.

> # parse up lvscan output
> lvscan 2>&1 | grep ACTIVE | awk '{print $2;}' | \
> while read DEV ; do
> 
> 	if [ "$SNAPSIZE" -gt "$SPACE" ] ; then
> 		log "err" "Can't take a snapshot of $DEV: not enough free space in the VG."
> 		continue

Well, the 1/500 rule is only a guideline.  For example, I have a huge
filesystem for TV shows, but it doesn't change that often, so it would
make more sense to just reduce $SNAPSIZE to $SPACE (assuming some minimum
amount of free space is available).

Make a default, that is settable in the .conf file:
	MINFREE=0	# megabytes to leave free in each volume group
	MINSNAP=256	# megabytes for minimum snapshot size.

	# make snapshot large enough to handle e.g. journal and other updates
	[ $SNAPSIZE -lt $MINSNAP ] && SNAPSIZE=$MINSNAP

	# limit snapshot to available space
	[ $SNAPSIZE -gt $((SPACE - MINFREE)) ] && SNAPSIZE=$((SPACE - MINFREE))

	# if we don't have enough space, skip this check
	if [ $SNAPSIZE -lt $MINSNAP ]; then
		log "warning" "Check of $LV can't get ${SNAPSIZE}MB, skipping"
		continue
	fi

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.