[linux-lvm] Re: Found: workaround for crash on snapshot removal, and hopefully a good clue to the underlying bug

James G. Sack (jim) jsack at tandbergdatacorp.com
Fri Dec 9 02:07:32 UTC 2005


More testing results..

A) The snapshot create/remove cycle with the suspend/resume calls around
the lvremove ran over 1500 passes before I stopped it -- all the while
with continuous i/o to the origin filesystem. Remember this is on a
patched 2.6.14-1_1637_FC4 (patches listed in previous message below). 

B) I installed vanilla FC4 build 2.6.14-1.1644_FC4, and tried the same
test, but in this case, the suspend prevents lvremove from running -- I
guess an automatic suspend has been added to 1644 which was missing or
broken in 1637 (maybe?). Anyway, I took out the suspend/resume and the
crash came back! So maybe the patches had something to do with test A
succeeding?

C) I rebooted to valilla 2.6.14-1_1637_FC4 and am now starting a test
with the suspend/resume calls around the lvremove. So far it looks like
it's passed a few dozen cycles. So maybe the patches are irrelevant. 

Can anybody make any sense of this? 

I'm logging 'level = 6' to lvm2.log -- would anybody be able to suggest
what to look for in there? Hmmm, maybe tomorrow, I should create a
simple log with a single failure to see if there's any locking
asymmetries or something like that.

Another context reminder: I'm runnning
lvm version
  LVM version:     2.02.01-cvs (2005-11-10)
  Library version: 1.02.01-cvs (2005-11-10)
  Driver version:  4.4.0

Will let the test run overnight, and report tomorrow.

Regards,
..jim



On Thu, 2005-12-08 at 17:41 -0800, James G. Sack (jim) wrote:
> Hooray! 
> 
> I think I've found a definitive clue to a crash during lvremove of a
> snapshot. I have a reliably repeatable failure test and a workaround
> that seems to be passing.
> 
> Here's the regression test:
> --------------------------
> 
> 1. arrange to have some continuous i/o on an lvm volume
>  I do it with a simple shell loop that copies a 1GB file to another name
> and then back (essentially: 'while :;do cp abcd wxyz;cp wxyz abcd;done')
> 
> 2. while that's running, start a snapshot create/remove loop
>  Such as 'while :;do lvcreate -snSnap -L10G LVorigin;
>   lvremove -f /dev/VG/Snap;done
> 
> My experience is that a system crash always occurs upon executing the
> lvremove call. The first one! 
> 
>   (On my most recent experiments, the system is locking hard, 
>    although earlier I was able to see a kcopyd oops and the 
>    keyboard scollback worked.)
> 
> 
> Here's the workaround
> ---------------------
> 
> In the snap-cycle test surround the lvremove command with suspend/resume
>   dmsetup suspend VG-LVorigin
>   lvremove -f /dev/VGorigin/Snap
>   dmsetup resume VG-LVorigin
> 
> I am currently testing this workaround on a patched 2.6.14-1.1637_FC4
> kernel 
>   (using 4 patches suggested by agk on Tue, 15 Nov 2005 22:33:58 +0000)
> 
> <excerpt from that prior message>
> ---------------------------------
> > > The kcopyd.c BUG at line 145 is triggered by the first lvremove
> > > following start of the i/o (copy loop).
> 
> Try some kernel patches.
> 
>   http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/
> 
> in particular these four:
> 
>   dm-snapshot-bio_list-fix.patch
>   dm-snapshot-metadata-reading-separation.patch
>   dm-snapshot-load-metadata-on-creation.patch
>   dm-ioctl-reduce-pf-memalloc-usage.patch
> </excerpt>
>   
> 
> ==> BUT I suspect the lvremove problem is independent of those patches,
> as I was getting the same symptom before putting in the suspend/resume.
> 
> 
> I thought I had tried suspend/resume previously and found that they were
> unnecessary because the create automatically performed a suspend/resume
> -- so my current workaround is the result of a desperation-experiment of
> applying the suspend/resume wrapper ONLY to the lvremove step. 
> 
> ==> SO MAYBE this current success points to a bug in the lvremove code,
> eh?
> 
> 
> I plan on repeating my test on a vanilla kernel. In the meantime, I hope
> someone can look at the lvremove code (agk?..).
> 
> Regards,
> ..jim
> 
> 




More information about the linux-lvm mailing list