[Linux-cluster] Options other than reboot to stop DP processes thatcan't be killed -9

Alan Brown ajb2 at mssl.ucl.ac.uk
Mon Aug 22 12:52:11 UTC 2011


Colin Simpson wrote:
> Probably not a cluster issue just pure kernel question.  Sounds like the
> driver or device is locked up and the driver or device is confused, so
> the processes attached to it will be hung. 

A common problem in a fabric environment is that there are 2+ paths to 
the tapes (ie, 2 HBAs on the server) and commands may take either path 
(drives get confused by this). Sending an unlock/reset command via the 
other path is usually sufficient to recover but it's an extremely poorly 
documented area.

The most common case of this is tapes which refuse to eject - lock 
commands are per source and ORed, so unlock commands have to come from 
the same HBA(s) which issued the lock. I've added scripts to my bacula 
tape handling routines to ensure this happens on our setup.

> To be honest I've had similar problems on pretty much all Unixes for
> many years. And I've never found a good way out of it. Maybe not an
> option with your case and application, but I guess why most people have
> their backup systems running on separate dedicated boxes so it can be
> rebooted without affecting production systems.

Strongly agree. There are a number of other good reasons for running 
dedicated backup systems, not least of which is the double-barrel 
difficulty of bootstrapping a restore of the backup system itself AND 
the dead cluster box in a worst case scenario (It's a lot easier with 
separate boxes as in most cases only one gets trashed and you can reduce 
risk further by physically separating backups from operational servers.

A second good reason is the amount of IO a good tape backup solution can 
generate - LTO tapes easily outrun spinning media, so a spooling setup 
is needed to avoid shoeshine issues.

All this stuff is best discussed on a list dedicated to backups. 
Discussions of this kind show up regularly and there are a number of 
canned answers at hand.

AB





More information about the Linux-cluster mailing list