forced fsck (again?)

Andreas Dilger adilger at sun.com
Tue Jan 29 23:56:27 UTC 2008


On Jan 28, 2008  19:56 -0500, Bryan Kadzban wrote:
> >> # Assume the script won't run more than one instance at a time?
> >> lvremove -f "${lvtemp##/dev}"
> > 
> > Should check the error return and bail out of script if there is an error.
> 
> Will that catch the "more than one instance at a time" case (e.g. if
> another script run is still running e2fsck on this snapshot)?  Assuming
> lvremove can fail (and it probably can), it's probably a good idea to
> check it in any case, but if running e2fsck makes lvremove fail (until
> e2fsck finishes), that's a decent way to get rid of the comment too.
> 
> Also, I think it'd be better to skip just the current FS, rather than an
> "exit 1" type bail-out, right?

It's a hard call...  In some sense if there is an error we may leave a
string of LVs around that are filling up the VG, but the presence of
the LV (and hopefully being unable to remove it while e2fsck is running)
also serves as a "locking" mechanism in case some e2fsck takes a very
long time to run.

I guess as long as we print something in the syslog, and the LV remains
in place with a suitably clear "this isn't very useful" name, then
eventually the user will notice it and delete it.

> - -----
> 
> Create a script to transparently run fsck in the background on any
> active LVM logical volumes, as long as the machine is on AC power, and
> that LV has been last checked more than a configurable number of days
> ago.  Also create an optional configuration file to set various options
> in the script.
> 
> Signed-Off-By: Bryan Kadzban <bryan at kadzban.is-a-geek.net>

You can add a Signed-Off-By: Andreas Dilger <adilger at sun.com> here,
as it does everything I think is needed at this point...

Probably good to put a version number in the script, along with
your name/email so it is clear what version a user is running.

> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.7 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> 
> iD8DBQFHnnnRS5vET1Wea5wRAw0iAJ9wcLyfBSaH5FSIJNH0YakzDCUvjwCgnJEH
> lPScP39vBYIIjOQPiftgDs8=
> =XjFF
> -----END PGP SIGNATURE-----

> #!/bin/sh
> #
> # lvcheck
> 
> # Released under the GNU General Public License, either version 2 or
> #  (at your option) any later version.
> 
> # Overview:
> #
> #  Run this from cron periodically (e.g. once per week).  If the
> #  machine is on AC power, it will run the checks; otherwise they will
> #  all be skipped.  (If the script can't tell whether the machine is
> #  on AC power, it will use a setting in the configuration file
> #  (/etc/lvcheck.conf) to decide whether to continue with the checks,
> #  or abort.)
> #
> #  The script will then decide which logical volumes are active, and
> #  can therefore be checked via an LVM snapshot.  Each of these LVs
> #  will be queried to find its last-check day, and if that was more
> #  than $INTERVAL days ago (where INTERVAL is set in the configuration
> #  file as well), or if the last-check day can't be determined, then
> #  the script will take an LVM snapshot of that LV and run fsck on the
> #  snapshot.  The snapshot will be set to use 1/500 the space of the
> #  source LV.  After fsck finishes, the snapshot is destroyed.
> #  (Snapshots are checked serially.)
> #
> #  Any LV that passes fsck should have its last-check time updated (in
> #  the real superblock, not the snapshot's superblock); any LV whose
> #  fsck fails will send an email notification to a configurable user
> #  ($EMAIL).  This $EMAIL setting is optional, but its use is highly
> #  recommended, since if any LV fails, it will need to be checked
> #  manually, offline.  Relevant messages are also sent to syslog.
> 
> # Set default values for configuration params.  Changes to these values
> #  will be overwritten on an upgrade!  To change these values, use
> #  /etc/lvcheck.conf.
> EMAIL='root'
> INTERVAL=30
> AC_UNKNOWN="CONTINUE"
> MINSNAP=256
> MINFREE=0
> 
> # send $2 to syslog, with severity $1
> # severities are emerg/alert/crit/err/warning/notice/info/debug
> function log() {
> 	local sev="$1"
> 	local msg="$2"
> 	local arg=
> 
> 	# log warning-or-higher messages to stderr as well
> 	[ "$sev" == "emerg" || "$sev" == "alert" || "$sev" == "crit" || \
> 			"$sev" == "err" || "$sev" == "warning" ] && arg=-s
> 
> 	logger -t lvcheck $arg -p user."$sev" -- "$msg"
> }
> 
> # determine whether the machine is on AC power
> function on_ac_power() {
> 	local any_known=no
> 
> 	# try sysfs power class first
> 	if [ -d /sys/class/power_supply ] ; then
> 		for psu in /sys/class/power_supply/* ; do
> 			if [ -r "${psu}/type" ] ; then
> 				type="`cat "${psu}/type"`"
> 
> 				# ignore batteries
> 				[ "${type}" = "Battery" ] && continue
> 
> 				online="`cat "${psu}/online"`"
> 
> 				[ "${online}" = 1 ] && return 0
> 				[ "${online}" = 0 ] && any_known=yes
> 			fi
> 		done
> 
> 		[ "${any_known}" = "yes" ] && return 1
> 	fi
> 
> 	# else fall back to AC adapters in /proc
> 	if [ -d /proc/acpi/ac_adapter ] ; then
> 		for ac in /proc/acpi/ac_adapter/* ; do
> 			if [ -r "${ac}/state" ] ; then
> 				grep -q on-line "${ac}/state" && return 0
> 				grep -q off-line "${ac}/state" && any_known=yes
> 			elif [ -r "${ac}/status" ] ; then
> 				grep -q on-line "${ac}/status" && return 0
> 				grep -q off-line "${ac}/status" && any_known=yes
> 			fi
> 		done
> 
> 		[ "${any_known}" = "yes" ] && return 1
> 	fi
> 
> 	if [ "$AC_UNKNOWN" == "CONTINUE" ] ; then
> 		return 0   # assume on AC power
> 	elif [ "$AC_UNKNOWN" == "ABORT" ] ; then
> 		return 1   # assume on battery
> 	else
> 		log "err" "Invalid value for AC_UNKNOWN in the config file"
> 		exit 1
> 	fi
> }
> 
> # attempt to force a check of $1 on the next reboot
> function try_force_check() {
> 	local dev="$1"
> 	local fstype="$2"
> 
> 	case "$fstype" in
> 	ext2|ext3)
> 		tune2fs -C 16000 "$dev"
> 		;;
> 	*)
> 		log "warning" "Don't know how to force a check on $fstype..."
> 		;;
> 	esac
> }
> 
> # attempt to set the last-check time on $1 to now, and the mount count to 0.
> function try_delay_checks() {
> 	local dev="$1"
> 	local fstype="$2"
> 
> 	case "$fstype" in
> 	ext2|ext3)
> 		tune2fs -C 0 -T now "$dev"
> 		;;
> 	*)
> 		log "warning" "Don't know how to delay checks on $fstype..."
> 		;;
> 	esac
> }
> 
> # print the date that $1 was last checked, in a format that date(1) will
> #  accept, or "Unknown" if we don't know how to find that date.
> function try_get_check_date() {
> 	local dev="$1"
> 	local fstype="$2"
> 
> 	case "$fstype" in
> 	ext2|ext3)
> 		dumpe2fs -h "$dev" 2>/dev/null | grep 'Last checked:' | \
> 				sed -e 's/Last checked:[[:space:]]*//'
> 		;;
> 	*)
> 		# TODO: add support for various FSes here
> 		echo "Unknown"
> 		;;
> 	esac
> }
> 
> # check the FS on $1 passively, saving output to $3.
> function perform_check() {
> 	local dev="$1"
> 	local fstype="$2"
> 	local tmpfile="$3"
> 
> 	case "$fstype" in
> 	ext2|ext3)
> 		nice logsave -as "${tmpfile}" e2fsck -fn "$dev"
> 		return $?
> 		;;
> 	reiserfs)
> 		echo Yes | nice logsave -as "${tmpfile}" fsck.reiserfs --check "$dev"
> 		# apparently can't fail?  let's hope not...
> 		return 0
> 		;;
> 	xfs)
> 		nice logsave -as "${tmpfile}" xfs_check "$dev"
> 		return $?
> 		;;
> 	jfs)
> 		nice logsave -as "${tmpfile}" fsck.jfs -fn "$dev"
> 		return $?
> 		;;
> 	*)
> 		log "warning" "Don't know how to check $fstype filesystems passively: assuming OK."
> 		;;
> 	esac
> }
> 
> # do everything needed to check and reset dates and counters on /dev/$1/$2.
> function check_fs() {
> 	local vg="$1"
> 	local lv="$2"
> 	local fstype="$3"
> 	local snapsize="$4"
> 
> 	local tmpfile=`mktemp -t lvcheck.log.XXXXXXXXXX`
> 	local errlog="/var/log/lvcheck-${vg}@${lv}-`date +'%Y%m%d'`"
> 	local snaplvbase="${lv}-lvcheck-temp"
> 	local snaplv="${snaplvbase}-`date +'%Y%m%d'`"
> 
> 	# clean up any left-over snapshot LVs
> 	for lvtemp in /dev/${vg}/${snaplvbase}* ; do
> 		if [ -e "$lvtemp" ] ; then
> 			# Assume the script won't run more than one instance at a time?
> 
> 			log "warning" "Found stale snapshot $lvtemp: attempting to remove."
> 
> 			if ! lvremove -f "${lvtemp##/dev}" ; then
> 				log "error" "Could not delete stale snapshot $lvtemp"
> 				return 1
> 			fi
> 		fi
> 	done
> 
> 	# and create this one
> 	lvcreate -s -l "$snapsize" -n "${snaplv}" "${vg}/${lv}"
> 
> 	if perform_check "/dev/${vg}/${snaplv}" "${fstype}" "${tmpfile}" ; then
> 		log "info" "Background scrubbing of /dev/${vg}/${lv} succeeded."
> 		try_delay_checks "/dev/${vg}/${lv}" "$fstype"
> 	else
> 		log "err" "Background scrubbing of /dev/${vg}/${lv} failed: run fsck offline soon!"
> 		try_force_check "/dev/${vg}/${lv}" "$fstype"
> 
> 		if test -n "$EMAIL"; then
> 			mail -s "Fsck of /dev/${vg}/${lv} failed!" $EMAIL < $tmpfile
> 		fi
> 
> 		# save the log file in /var/log in case mail is disabled
> 		mv "$tmpfile" "$errlog"
> 	fi
> 
> 	rm -f "$tmpfile"
> 	lvremove -f "${vg}/${snaplv}"
> }
> 
> # pull in configuration -- overwrite the defaults above if the file exists
> [ -r /etc/lvcheck.conf ] && . /etc/lvcheck.conf
> 
> # check whether the machine is on AC power: if not, skip fsck
> on_ac_power || exit 0
> 
> # parse up lvscan output
> lvscan 2>&1 | grep ACTIVE | awk '{print $2;}' | \
> while read DEV ; do
> 	# remove the single quotes around the device name
> 	DEV="`echo "$DEV" | tr -d \'`"
> 
> 	# get the FS type: blkid prints TYPE="blah"
> 	eval `blkid -s TYPE "$DEV" | cut -d' ' -f2`
> 
> 	# get the last-check time
> 	check_date=`try_get_check_date "$DEV" "$TYPE"`
> 
> 	# if the date is unknown, run fsck every time the script runs.  sigh.
> 	if [ "$check_date" != "Unknown" ] ; then
> 		# add $INTERVAL days, and throw away the time portion
> 		check_day=`date --date="$check_date $INTERVAL days" +'%Y%m%d'`
> 
> 		# get today's date, and skip the check if it's not within the interval
> 		today=`date +'%Y%m%d'`
> 		[ $check_day -gt $today ] && continue
> 	fi
> 
> 	# get the volume group and logical volume names
> 	VG="`lvs --noheadings -o vg_name "$DEV"`"
> 	LV="`lvs --noheadings -o lv_name "$DEV"`"
> 
> 	# get the free space and LV size (in megs), guess at the snapshot
> 	#  size, and see how much the admin will let us use (keeping MINFREE
> 	#  available)
> 	SPACE="`lvs --noheadings --units M --nosuffix -o vg_free "$DEV"`"
> 	SIZE="`lvs --noheadings --units M --nosuffix -o lv_size "$DEV"`"
> 	SNAPSIZE="`expr "$SIZE" / 500`"
> 	AVAIL="`expr "$SPACE" - "$MINFREE"`"
> 
> 	# if we don't even have MINSNAP space available, skip the LV
> 	if [ "$MINSNAP" -gt "$AVAIL" -o "$AVAIL" -le 0 ] ; then
> 		log "warning" "Not enough free space on volume group for ${DEV}; skipping"
> 		continue
> 	fi
> 
> 	# make snapshot large enough to handle e.g. journal and other updates
> 	[ "$SNAPSIZE" -lt "$MINSNAP" ] && SNAPSIZE="$MINSNAP"
> 
> 	# limit snapshot to available space (VG space minus min-free)
> 	[ "$SNAPSIZE" -gt "$AVAIL" ] && SNAPSIZE="$AVAIL"
> 
> 	# don't need to check SNAPSIZE again: MINSNAP <= AVAIL, MINSNAP <= SNAPSIZE,
> 	#  and SNAPSIZE <= AVAIL, combined, means SNAPSIZE must be between MINSNAP
> 	#  and AVAIL, which is what we need -- assuming AVAIL > 0
> 
> 	# check it
> 	check_fs "$VG" "$LV" "$TYPE" "$SNAPSIZE"
> done
> 

> #!/bin/sh
> 
> # e2check configuration file

Minor note - "lvscan configuration file".

> # This file follows the pattern of sshd_config: default
> #  values are shown here, commented-out.
> 
> #  EMAIL
> #   Address to send failure notifications to.  If empty,
> #   failure notifications will not be sent.
> 
> #EMAIL='root'
> 
> #  INTERVAL
> #   Days to wait between checks.  All LVs use the same
> #   INTERVAL, but the "days since last check" value can
> #   be different per LV, since that value is stored in
> #   the filesystem superblock.
> 
> #INTERVAL=30
> 
> #  AC_UNKNOWN
> #   Whether to run the e2fsck checks if the script can't
> #   determine whether the machine is on AC power.  Laptop
> #   users will want to set this to ABORT, while server and
> #   desktop users will probably want to set this to
> #   CONTINUE.  Those are the only two valid values.
> 
> #AC_UNKNOWN="CONTINUE"
> 
> #  MINSNAP
> #   Minimum snapshot size to take, in megabytes.  The
> #   default snapshot size is 1/500 the size of the logical
> #   volume, but if that size is less than MINSNAP, the
> #   script will use MINSNAP instead.  This should be large
> #   enough to handle e.g. journal updates, and other disk
> #   changes that require (semi-)constant space.
> 
> #MINSNAP=256
> 
> #  MINFREE
> #   Minimum amount of space (in megabytes) to keep free in
> #   each volume group when creating snapshots.
> 
> #MINFREE=0
> 


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.




More information about the Ext3-users mailing list