[lvm-devel] [PATCH v2]: Mirror: Fix hangs and lock-ups caused by attempting label reads of mirrors

Zdenek Kabelac zkabelac at redhat.com
Wed Oct 23 08:50:08 UTC 2013


Dne 23.10.2013 01:39, Jonathan Brassow napsal(a):
> Changed some variable/function names and added more explanation to the
> config file.
>
> I will send a separate patch that contains a warning message if mirrors
> are activated and 'ignore_lvm_mirrors' is not set... We can talk about
> whether that is needed also.
>
>   brassow
>
> Mirror: Fix hangs and lock-ups caused by attempting label reads of mirrors
>
> There is a problem with the way mirrors have been designed to handle
> failures that is resulting in stuck LVM processes and hung I/O.  When
> mirrors encounter a write failure, they block I/O and notify userspace
> to reconfigure the mirror to remove failed devices.  This process is
> open to a couple races:
> 1) Any LVM process other than the one that is meant to deal with the
> mirror failure can attempt to read the mirror, fail, and block other
> LVM commands (including the repair command) from proceeding due to
> holding a lock on the volume group.
> 2) If there are multiple mirrors that suffer a failure in the same
> volume group, a repair can block while attempting to read the LVM
> label from one mirror while trying to repair the other.
>
> Mitigation of these races has been attempted by disallowing label reading
> of mirrors that are either suspended or are indicated as blocking by
> the kernel.  While this has closed the window of opportunity for hitting

Is mirror read 'abort-able' (i.e. sigalarm()) when it's blocked ?
So our  'scan' routine could try to read mirror - which suddenly
gets 'frozen' by write error.
If we would have used sigalarm - we should be able abort() read operation
(though I'm not sure where the read gets stuck - maybe it would need change in 
the kernel driver?)  - after read failure we may detect mirror error 
conditions through dm status - and make some reaction?

The very similar thing needs to be added for scanning of i.e. thinly 
provisioned devices - which may get stuck when the pool is overfilled - so 
some solution in this direction is unavoidable - IMHO we should not hide the 
problem by disabling of scanning).


> 2) Instrument a way to allow asynchronous label reading - allowing
> blocked label reads to be ignored while continuing to process the LVM
> command.  This would action would allow LVM commands to continue even
> though they would have otherwise blocked trying to read a mirror.  They
> can then release their lock and allow a repair command to commence.  In
> the event of #2 above, the repair command already in progress can continue
> and repair the failed mirror.

Async read is not the only problem here - we have other issues:

i.e. activate mirror - and wait for confirmation  (dmsetup udevcomplete)
but this may also run watch rule - and also  blkid may get blocked (mirror error)

So now we get into fancy states - where  our command is waiting for
semaphore completion (no timeout on semaphore for now) - which doesn't happen 
since master udev kills its udev scan completely  - without any 'finalization' 
step.

So - we would need to probably make a mirror device also 'unscannable' ??
(which makes it unusable for filesystems??)

Anyway - more troubles ahead....

Zdenek





More information about the lvm-devel mailing list