Is there room for improvement in rescue mode? (was Re: Goodbye, Fedora)

Keith G. Robertson-Turner fedora-gmane.00003 at genesis-x.nildram.co.uk
Thu Feb 22 21:34:24 UTC 2007


Verily I say unto thee, that Jeff Spaleta spake thusly:

> In an effort to chart a new course of constructive discussion... is
> it worth brainstorming a bit about how to make rescue mode better or
> more accessible?

The current rescue mode is certainly sufficient for experienced
admins, however it would be a good idea to implement some helper
scripts, and possibly even a fluxbox minimal environment. The latter
would be especially useful to facilitate administering LVM via
system-config-lvm, as I must admit the lvm command syntax is still a
mystery to me.

The logical procedure should be, identify (as far as possible) what
*can* go wrong, think about how *you* would fix it, see if there's any
way to (semi)automate that process with helper scripts, and compare
that with what's currently available in the rescue environment.

Off the top of my head, I'd suggest:

1) Enable installing an immutable rescue partition, and add as a grub
   entry.

2) Add a minimal graphical environment.

3) Add a "Rescue Install" to Anaconda.

4) Add the various system-config-* helpers.

5) Have a dedicated RPM rescue tool, since this is a special
   case. I.e. is rpm + all deps correctly installed, are there stale
   locks, sanity check on the database, etc.

6) Anaconda suggests a backup partition, or asks for a network backup
   location, and sets up a cron job (SafeKeep?). I.e. push hard to
   make backup mandatory(ish). I'd also suggest Disk Druid, etc.,
   pushes the suggestion of LVM *and* a snapshot partition, which is
   IMHO essential.

You could do some checks to see if the default root system is
bootable, etc., then automatically fall back to rescue mode if not
(GRUB patch?), rather than allow the init to proceed then fail. This
is essential on a headless server, where it's "stuck" and you can't
ssh in to see why.

If the idea of a GUI doesn't appeal to you (and for network admins it
probably doesn't), I'd suggest the implementation of a ncurses
interface for some of the helper tools (long term).

As a side note, though not directly related to "rescue", I advocate
that yum should be patched to enable partial-failure, i.e. "update as
much as possible, root notify failures". I understand it is not a
popular theory, but broken deps/repos break automatic updates
completely, rather than partially, which could be a problem, e.g. on a
large network (like mine) where an essential security update (and all
other updates) are not deployed, simply because of *one* broken, and
non-essential, package. This just doesn't make any logical sense, and
could be an issue for those relying on automated mass system updates.

Anyway, back on topic, let's say *I* ask *you*, my sysadmin, to fix
the following. What would you (i.e. the script) need to do to
(semi)automate this? Not all of these *have* solutions, that can be
implemented in software, but even the *hardware* issues could be given
more verbose notification/suggestions:

1) swapon ... won't activate, because the swap drive is dead, but
   this is a low memory system set to automatically boot into X.

2) root filesystem mount failure.

3) Missing/corrupt initrd/bzimage.

4) Missing/non-funtional SCSI/IDE drivers in an *updated* kernel, so
   cannot mount root filesystem (but previous kernel works).

5) service <foobar> segfaults and halts init.

6) service <foobar> has (missing files | other problem) and waits
   forever (does not detach to daemon).

   (hint for 5 and 6 - watchdog timer)

7) Initscripts are b0rked, typo, non-fatal error, etc. (I recently
   caught one, still unresolved, nfs mountd problem). Why is this
   needed for rescue mode? Because not all startup errors are noticed
   by the (unobservant | people who blink a lot). :) A way of running
   through ($chroot)/init.d in rescue mode looking for non-zero return
   codes, and suggesting updates/workarounds etc., would be handy. But
   maybe this is stretching "rescue mode" a little too far.

8) RPM is b0rked. How do I reinstall RPM ... without RPM??? Cyclic
   dependency error 101: Arrrrggghh!

9) Again, maybe stretching "rescue" too far, but how about fslint in
   rescue mode, to clean up all those "#PRELINK", "foobar~", and other
   junk. Especially on a monolithic install (all under /) where /tmp
   is full.

10) Only other thing I can think of is, SMART disk health checks,
    however, according to Google's recent report (they did a massive
    test), SMART is next to useless at actually predicting failure.

That's it.

I'm sure 99% of the above is useless, but hey ... that's why they call
it brainstorming :)

-- 
K.
http://slated.org - Slated, Rated & Blogged

.----
| "Future archaeologists will be able to identify a 'Vista Upgrade
| Layer' when they go through our landfill sites" - Sian Berry, the
| Green Party.
`----

Fedora Core release 5 (Bordeaux) on sky, running kernel 2.6.19-1.2288.fc5
 21:32:25 up 3 days,  8:57,  2 users,  load average: 0.26, 0.31, 0.27




More information about the fedora-devel-list mailing list