[dm-devel] [PATCH 1/1] MD/DM RAID: Fix hang due to recent RAID5 locking changes
Brassow Jonathan
jbrassow at redhat.com
Mon Nov 25 19:08:56 UTC 2013
On Nov 25, 2013, at 8:20 AM, Brassow Jonathan wrote:
>
> On Nov 24, 2013, at 6:03 PM, NeilBrown wrote:
>
>> On Sun, 24 Nov 2013 17:30:43 -0600 Jonathan Brassow <jbrassow at redhat.com>
>> wrote:
>>
>>> When commit 773ca82 was made in v3.12-rc1, it caused RAID4/5/6 devices
>>> that were created via device-mapper (dm-raid.c) to hang on creation.
>>> This is not necessarily the fault of that commit, but perhaps the way
>>> dm-raid.c was setting-up and activating devices.
>>>
>>> Device-mapper allows I/O and memory allocations in the constructor
>>> (i.e. raid_ctr()), but nominal and recovery I/O should not be allowed
>>> until a 'resume' is issued (i.e. raid_resume()). It has been problematic
>>> (at least in the past) to call mddev_resume before mddev_suspend was
>>> called, but this is how DM behaves - CTR then resume. To solve the
>>> problem, raid_ctr() was setting up the structures, calling md_run(), and
>>> then also calling mddev_suspend(). The stage was then set for raid_resume()
>>> to call mddev_resume().
>>>
>>> Commit 773ca82 caused a change in behavior during raid5.c:run().
>>> 'setup_conf->grow_stripes->grow_one_stripe' is called which creates the
>>> stripe cache and increments 'active_stripes'.
>>> 'grow_one_stripe->release_stripe' doesn't actually decrement 'active_stripes'
>>> anymore. The side effect of this is that when raid_ctr calls mddev_suspend,
>>> it waits for 'active_stripes' to reduce to 0 - which never happens.
>>
>> Hi Jon,
>> this sounds like the same bug that is fixed by
>>
>> commit ad4068de49862b083ac2a15bc50689bb30ce3e44
>> Author: majianpeng <majianpeng at gmail.com>
>> Date: Thu Nov 14 15:16:15 2013 +1100
>>
>> raid5: Use slow_path to release stripe when mddev->thread is null
>>
>> which is already en-route to 3.12.x. Could you check if it fixes the bug for
>> you?
>
> Sure, I'll check. Just reading the subject of the patch, I have high hopes. The slow path decrements 'active_stripes', which was causing the above problem... I'll make sure though.
Yes, this patch fixes the issue in 3.12-rc1+.
However, there is still a problem I'm searching for that was introduced in commit 566c09c (at least that's what I get when bisecting).
The problem only shows up when I have taken a snapshot of a RAID5 device and only if I have cycled the device before adding the snapshot:
1> lvcreate --type raid5 -i 3 -L 20M -n lv vg
2> lvchange -an vg/lv
3> lvchange -ay vg/lv
4> lvcreate -s vg/lv -L 50M -n snap
5> lvchange -an vg/lv
6> lvchange -ay vg/lv -- BUG: line 292 of raid5.c
The current bug triggers on the 'BUG_ON(atomic_read(&conf->active_stripes)==0)' in do_release_stripe(). I'm not sure why yet.
brassow
More information about the dm-devel
mailing list