[dm-devel] snapshots of mirror problems
Mikulas Patocka
mpatocka at redhat.com
Wed Aug 18 14:50:34 UTC 2010
The problem is this:
Creating the first snapshot:
----------------------------
- preloads -cow, -real devices and origin and snapshot targets
- suspends the underlying lv (mirror in this case) without
DM_SUSPEND_NOFLUSH_FLAG and with DM_SUSPEND_LOCKFS_FLAG. This waits for
all bios to drain and calls a filesystem driver to bring it to consistent
state.
- swap table with origin targets
- resumes the underlying lv, the snapshot target and the origin target
Handing a mirror failure:
-------------------------
- preload the new table with linear volume or a mirror with reduced number
of legs or a mirror with new legs allocated according to the allocation
policy
- suspend the mirror with "noflush" flag, "noflush" causes that failing
bios are queued in device mapper
- swap table with the new one
- resume the mirror, queued buis are dequeued and passed to the new device
Now, the problem:
-----------------
1. If you say that these two operations are independednt, two processes
will race with suspend and resume on the same device. Bad.
2. If you put lock around, it changes into deadlock possibility: if during
bio draining or filesystem cleanup dm-raid1 suffers a failure, the failure
can't be recovered.
3. If you are suspending without DM_SUSPEND_NOFLUSH_FLAG, DM_ENDIO_REQUEUE
is not allowd and requests returned with DM_ENDIO_REQUEUE are returned
with -EIO (see function dec_pending). So if mirror leg or log failure
happens, dm-raid1 returns DM_ENDIO_REQUEUE and the I/O is incorrectly
finished with -EIO. If you remove this DM_ENDIO_REQUEUE->-EIO logic from
dec_pending, go to case 2 above (deadlock).
As of the possibility "it is very improbable" --- I think there is one
case where the probability may be more than minimal. If the user has a
mounted filesystem and doesn't use it for long time, the disk may have
failed (or be unplugged) and the system doesn't notice it because the disk
isn't used. Now, if the user creates a snapshot of mirror and it starts
cleaning up filesystem journal, it may be the point where the disk error
is detected. But it can't be repaired.
I think it isn't easy to fix (see those 3 points above), the only possible
ways to fix it would be:
- make the mirror self-sufficient (integrate md)
or
- attach dummy dm-linear (or snapshot-origin) passthrough target on the
top of each mirror. If we do it, snapshot creation could suspend this
dummy passthrough target and simultaneously dmeventd could suspend the
underlying mirror and there would be no race or deadlock.
Mikulas
More information about the dm-devel
mailing list