forced fsck (again?)
Andreas Dilger
adilger at sun.com
Tue Jan 29 23:56:27 UTC 2008
On Jan 28, 2008 19:56 -0500, Bryan Kadzban wrote:
> >> # Assume the script won't run more than one instance at a time?
> >> lvremove -f "${lvtemp##/dev}"
> >
> > Should check the error return and bail out of script if there is an error.
>
> Will that catch the "more than one instance at a time" case (e.g. if
> another script run is still running e2fsck on this snapshot)? Assuming
> lvremove can fail (and it probably can), it's probably a good idea to
> check it in any case, but if running e2fsck makes lvremove fail (until
> e2fsck finishes), that's a decent way to get rid of the comment too.
>
> Also, I think it'd be better to skip just the current FS, rather than an
> "exit 1" type bail-out, right?
It's a hard call... In some sense if there is an error we may leave a
string of LVs around that are filling up the VG, but the presence of
the LV (and hopefully being unable to remove it while e2fsck is running)
also serves as a "locking" mechanism in case some e2fsck takes a very
long time to run.
I guess as long as we print something in the syslog, and the LV remains
in place with a suitably clear "this isn't very useful" name, then
eventually the user will notice it and delete it.
> - -----
>
> Create a script to transparently run fsck in the background on any
> active LVM logical volumes, as long as the machine is on AC power, and
> that LV has been last checked more than a configurable number of days
> ago. Also create an optional configuration file to set various options
> in the script.
>
> Signed-Off-By: Bryan Kadzban <bryan at kadzban.is-a-geek.net>
You can add a Signed-Off-By: Andreas Dilger <adilger at sun.com> here,
as it does everything I think is needed at this point...
Probably good to put a version number in the script, along with
your name/email so it is clear what version a user is running.
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.7 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFHnnnRS5vET1Wea5wRAw0iAJ9wcLyfBSaH5FSIJNH0YakzDCUvjwCgnJEH
> lPScP39vBYIIjOQPiftgDs8=
> =XjFF
> -----END PGP SIGNATURE-----
> #!/bin/sh
> #
> # lvcheck
>
> # Released under the GNU General Public License, either version 2 or
> # (at your option) any later version.
>
> # Overview:
> #
> # Run this from cron periodically (e.g. once per week). If the
> # machine is on AC power, it will run the checks; otherwise they will
> # all be skipped. (If the script can't tell whether the machine is
> # on AC power, it will use a setting in the configuration file
> # (/etc/lvcheck.conf) to decide whether to continue with the checks,
> # or abort.)
> #
> # The script will then decide which logical volumes are active, and
> # can therefore be checked via an LVM snapshot. Each of these LVs
> # will be queried to find its last-check day, and if that was more
> # than $INTERVAL days ago (where INTERVAL is set in the configuration
> # file as well), or if the last-check day can't be determined, then
> # the script will take an LVM snapshot of that LV and run fsck on the
> # snapshot. The snapshot will be set to use 1/500 the space of the
> # source LV. After fsck finishes, the snapshot is destroyed.
> # (Snapshots are checked serially.)
> #
> # Any LV that passes fsck should have its last-check time updated (in
> # the real superblock, not the snapshot's superblock); any LV whose
> # fsck fails will send an email notification to a configurable user
> # ($EMAIL). This $EMAIL setting is optional, but its use is highly
> # recommended, since if any LV fails, it will need to be checked
> # manually, offline. Relevant messages are also sent to syslog.
>
> # Set default values for configuration params. Changes to these values
> # will be overwritten on an upgrade! To change these values, use
> # /etc/lvcheck.conf.
> EMAIL='root'
> INTERVAL=30
> AC_UNKNOWN="CONTINUE"
> MINSNAP=256
> MINFREE=0
>
> # send $2 to syslog, with severity $1
> # severities are emerg/alert/crit/err/warning/notice/info/debug
> function log() {
> local sev="$1"
> local msg="$2"
> local arg=
>
> # log warning-or-higher messages to stderr as well
> [ "$sev" == "emerg" || "$sev" == "alert" || "$sev" == "crit" || \
> "$sev" == "err" || "$sev" == "warning" ] && arg=-s
>
> logger -t lvcheck $arg -p user."$sev" -- "$msg"
> }
>
> # determine whether the machine is on AC power
> function on_ac_power() {
> local any_known=no
>
> # try sysfs power class first
> if [ -d /sys/class/power_supply ] ; then
> for psu in /sys/class/power_supply/* ; do
> if [ -r "${psu}/type" ] ; then
> type="`cat "${psu}/type"`"
>
> # ignore batteries
> [ "${type}" = "Battery" ] && continue
>
> online="`cat "${psu}/online"`"
>
> [ "${online}" = 1 ] && return 0
> [ "${online}" = 0 ] && any_known=yes
> fi
> done
>
> [ "${any_known}" = "yes" ] && return 1
> fi
>
> # else fall back to AC adapters in /proc
> if [ -d /proc/acpi/ac_adapter ] ; then
> for ac in /proc/acpi/ac_adapter/* ; do
> if [ -r "${ac}/state" ] ; then
> grep -q on-line "${ac}/state" && return 0
> grep -q off-line "${ac}/state" && any_known=yes
> elif [ -r "${ac}/status" ] ; then
> grep -q on-line "${ac}/status" && return 0
> grep -q off-line "${ac}/status" && any_known=yes
> fi
> done
>
> [ "${any_known}" = "yes" ] && return 1
> fi
>
> if [ "$AC_UNKNOWN" == "CONTINUE" ] ; then
> return 0 # assume on AC power
> elif [ "$AC_UNKNOWN" == "ABORT" ] ; then
> return 1 # assume on battery
> else
> log "err" "Invalid value for AC_UNKNOWN in the config file"
> exit 1
> fi
> }
>
> # attempt to force a check of $1 on the next reboot
> function try_force_check() {
> local dev="$1"
> local fstype="$2"
>
> case "$fstype" in
> ext2|ext3)
> tune2fs -C 16000 "$dev"
> ;;
> *)
> log "warning" "Don't know how to force a check on $fstype..."
> ;;
> esac
> }
>
> # attempt to set the last-check time on $1 to now, and the mount count to 0.
> function try_delay_checks() {
> local dev="$1"
> local fstype="$2"
>
> case "$fstype" in
> ext2|ext3)
> tune2fs -C 0 -T now "$dev"
> ;;
> *)
> log "warning" "Don't know how to delay checks on $fstype..."
> ;;
> esac
> }
>
> # print the date that $1 was last checked, in a format that date(1) will
> # accept, or "Unknown" if we don't know how to find that date.
> function try_get_check_date() {
> local dev="$1"
> local fstype="$2"
>
> case "$fstype" in
> ext2|ext3)
> dumpe2fs -h "$dev" 2>/dev/null | grep 'Last checked:' | \
> sed -e 's/Last checked:[[:space:]]*//'
> ;;
> *)
> # TODO: add support for various FSes here
> echo "Unknown"
> ;;
> esac
> }
>
> # check the FS on $1 passively, saving output to $3.
> function perform_check() {
> local dev="$1"
> local fstype="$2"
> local tmpfile="$3"
>
> case "$fstype" in
> ext2|ext3)
> nice logsave -as "${tmpfile}" e2fsck -fn "$dev"
> return $?
> ;;
> reiserfs)
> echo Yes | nice logsave -as "${tmpfile}" fsck.reiserfs --check "$dev"
> # apparently can't fail? let's hope not...
> return 0
> ;;
> xfs)
> nice logsave -as "${tmpfile}" xfs_check "$dev"
> return $?
> ;;
> jfs)
> nice logsave -as "${tmpfile}" fsck.jfs -fn "$dev"
> return $?
> ;;
> *)
> log "warning" "Don't know how to check $fstype filesystems passively: assuming OK."
> ;;
> esac
> }
>
> # do everything needed to check and reset dates and counters on /dev/$1/$2.
> function check_fs() {
> local vg="$1"
> local lv="$2"
> local fstype="$3"
> local snapsize="$4"
>
> local tmpfile=`mktemp -t lvcheck.log.XXXXXXXXXX`
> local errlog="/var/log/lvcheck-${vg}@${lv}-`date +'%Y%m%d'`"
> local snaplvbase="${lv}-lvcheck-temp"
> local snaplv="${snaplvbase}-`date +'%Y%m%d'`"
>
> # clean up any left-over snapshot LVs
> for lvtemp in /dev/${vg}/${snaplvbase}* ; do
> if [ -e "$lvtemp" ] ; then
> # Assume the script won't run more than one instance at a time?
>
> log "warning" "Found stale snapshot $lvtemp: attempting to remove."
>
> if ! lvremove -f "${lvtemp##/dev}" ; then
> log "error" "Could not delete stale snapshot $lvtemp"
> return 1
> fi
> fi
> done
>
> # and create this one
> lvcreate -s -l "$snapsize" -n "${snaplv}" "${vg}/${lv}"
>
> if perform_check "/dev/${vg}/${snaplv}" "${fstype}" "${tmpfile}" ; then
> log "info" "Background scrubbing of /dev/${vg}/${lv} succeeded."
> try_delay_checks "/dev/${vg}/${lv}" "$fstype"
> else
> log "err" "Background scrubbing of /dev/${vg}/${lv} failed: run fsck offline soon!"
> try_force_check "/dev/${vg}/${lv}" "$fstype"
>
> if test -n "$EMAIL"; then
> mail -s "Fsck of /dev/${vg}/${lv} failed!" $EMAIL < $tmpfile
> fi
>
> # save the log file in /var/log in case mail is disabled
> mv "$tmpfile" "$errlog"
> fi
>
> rm -f "$tmpfile"
> lvremove -f "${vg}/${snaplv}"
> }
>
> # pull in configuration -- overwrite the defaults above if the file exists
> [ -r /etc/lvcheck.conf ] && . /etc/lvcheck.conf
>
> # check whether the machine is on AC power: if not, skip fsck
> on_ac_power || exit 0
>
> # parse up lvscan output
> lvscan 2>&1 | grep ACTIVE | awk '{print $2;}' | \
> while read DEV ; do
> # remove the single quotes around the device name
> DEV="`echo "$DEV" | tr -d \'`"
>
> # get the FS type: blkid prints TYPE="blah"
> eval `blkid -s TYPE "$DEV" | cut -d' ' -f2`
>
> # get the last-check time
> check_date=`try_get_check_date "$DEV" "$TYPE"`
>
> # if the date is unknown, run fsck every time the script runs. sigh.
> if [ "$check_date" != "Unknown" ] ; then
> # add $INTERVAL days, and throw away the time portion
> check_day=`date --date="$check_date $INTERVAL days" +'%Y%m%d'`
>
> # get today's date, and skip the check if it's not within the interval
> today=`date +'%Y%m%d'`
> [ $check_day -gt $today ] && continue
> fi
>
> # get the volume group and logical volume names
> VG="`lvs --noheadings -o vg_name "$DEV"`"
> LV="`lvs --noheadings -o lv_name "$DEV"`"
>
> # get the free space and LV size (in megs), guess at the snapshot
> # size, and see how much the admin will let us use (keeping MINFREE
> # available)
> SPACE="`lvs --noheadings --units M --nosuffix -o vg_free "$DEV"`"
> SIZE="`lvs --noheadings --units M --nosuffix -o lv_size "$DEV"`"
> SNAPSIZE="`expr "$SIZE" / 500`"
> AVAIL="`expr "$SPACE" - "$MINFREE"`"
>
> # if we don't even have MINSNAP space available, skip the LV
> if [ "$MINSNAP" -gt "$AVAIL" -o "$AVAIL" -le 0 ] ; then
> log "warning" "Not enough free space on volume group for ${DEV}; skipping"
> continue
> fi
>
> # make snapshot large enough to handle e.g. journal and other updates
> [ "$SNAPSIZE" -lt "$MINSNAP" ] && SNAPSIZE="$MINSNAP"
>
> # limit snapshot to available space (VG space minus min-free)
> [ "$SNAPSIZE" -gt "$AVAIL" ] && SNAPSIZE="$AVAIL"
>
> # don't need to check SNAPSIZE again: MINSNAP <= AVAIL, MINSNAP <= SNAPSIZE,
> # and SNAPSIZE <= AVAIL, combined, means SNAPSIZE must be between MINSNAP
> # and AVAIL, which is what we need -- assuming AVAIL > 0
>
> # check it
> check_fs "$VG" "$LV" "$TYPE" "$SNAPSIZE"
> done
>
> #!/bin/sh
>
> # e2check configuration file
Minor note - "lvscan configuration file".
> # This file follows the pattern of sshd_config: default
> # values are shown here, commented-out.
>
> # EMAIL
> # Address to send failure notifications to. If empty,
> # failure notifications will not be sent.
>
> #EMAIL='root'
>
> # INTERVAL
> # Days to wait between checks. All LVs use the same
> # INTERVAL, but the "days since last check" value can
> # be different per LV, since that value is stored in
> # the filesystem superblock.
>
> #INTERVAL=30
>
> # AC_UNKNOWN
> # Whether to run the e2fsck checks if the script can't
> # determine whether the machine is on AC power. Laptop
> # users will want to set this to ABORT, while server and
> # desktop users will probably want to set this to
> # CONTINUE. Those are the only two valid values.
>
> #AC_UNKNOWN="CONTINUE"
>
> # MINSNAP
> # Minimum snapshot size to take, in megabytes. The
> # default snapshot size is 1/500 the size of the logical
> # volume, but if that size is less than MINSNAP, the
> # script will use MINSNAP instead. This should be large
> # enough to handle e.g. journal updates, and other disk
> # changes that require (semi-)constant space.
>
> #MINSNAP=256
>
> # MINFREE
> # Minimum amount of space (in megabytes) to keep free in
> # each volume group when creating snapshots.
>
> #MINFREE=0
>
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
More information about the Ext3-users
mailing list