[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [libvirt] storage lock issues




On 11/02/2017 09:10 AM, Cedric Bosdonnat wrote:
> Hi John,
> 
> On Thu, 2017-11-02 at 07:03 -0400, John Ferlan wrote:
>>
>> On 11/02/2017 06:20 AM, Cedric Bosdonnat wrote:
>>> Hi all,
>>>
>>> I'm investigating a crash in the storage driver, and it seems the solution
>>> is not as obvious as I wanted it.
>>>
>>> Basically what happens is that libvirtd crashes due to the pool refresh thread
>>> clearing the list of pool volumes while someone is adding a new volume. Here
>>> is the valgrind log I have for that:
>>>
>>> http://paste.opensuse.org/58733866
>>>
>>
>> What version of libvirt?  Your line numbers don't match up with current
>> HEAD.
> 
> It's a 3.8.0.
> 
>>> I tried adding a call to virStoragePoolObjLock() in the
>>> virStorageVolPoolRefreshThread(), since the storageVolCreateXML()
>>> appears to have it, but I'm getting a dead lock. Here are the waiting
>>> threads I got from gdb:
>>
>> virStoragePoolObjFindByName returns a locked pool obj, so not sure
>> where/why you added this...
> 
> Oh I overlooked that... then there is already a lock on it, the solution
> is elsewhere
> 
>>>
>>> http://paste.opensuse.org/45961070
>>>
>>> Any idea what would be a proper fix for that? My vision of the storage
>>> drivers locks is too limited it seems ;)
>>
>> From just a quick look...
>>
>> The virStorageVolPoolRefreshThread will take storageDriverLock, get a
>> locked pool object (virStoragePoolObjFindByName), clear out the pool,
>> add back volumes that it knows about, unlock the pool object, and call
>> storageDriverUnlock.
>>
>> If you were adding a new pool... You'd take storageDriverLock, then
>> eventually virStoragePoolObjAssignDef would be called and the pool's
>> object lock taken and unlocked, and returns a locked pool object which
>> gets later unlocked in cleanup, followd by a storageDriverUnlock.
>>
>> If you're adding a new volume object, you'd get a locked pool object via
>> virStoragePoolObjFromStoragePool, then if building, that lock gets
>> dropped after increasing the async job count and setting the building
>> flag, the volume is built, then the object lock retaken while
>> temporarily holding the storageDriverLock, the async job count gets
>> decremented and the building flag cleared, eventually we fall into
>> cleanup with unlocks the pool again.
> 
> Thanks for the details: this is really useful to globally understand how
> locks are working in the storage driver. Maybe worth turning it into a doc
> somewhere?
> 

You mean code isn't self documenting, shocking ;-) [tongue firmly
planted in cheek].

Perhaps when I complete the conversion for Storage Pool/Volume and do
more Domain fix-ups I can add something somewhere that "generally"
describes internal flow.  The goal I've had is to be fairly similar for
all drivers w/r/t object usage and code flow.  It's taking a long time
to get there though because it's a lot of "roto-tilling" of code.


John

FWIW: I saw your query before we hopped in the car for a 7 hour drive -
figured at least providing something before I left would be helpful ;-).

>> So how to fix - well seems to me the storageDriverLock in VolCreateXML
>> may be the way since the refresh thread takes the driver lock first,
>> then the pool object second.  Perhaps even like the build code where it
>> takes it temporarily while getting the pool object. I'd have to think a
>> bit more about though. Still might be something to try since the Vol
>> Refresh thread takes it while running...
> 
> Ok, I'll look into that direction.
> 
>> John
>>
>> Not related to this problem per se, but what may help even more is if I
>> can get the storage driver usage of a common object model patches
>> completely reviewed, but that's a different problem ;-)... I'll have to
>> go look and see if I may have fixed this there. The newer model uses
>> hash tables, RW locks, and reduces the storage driver hammer lock, but
>> this one condition may not have been addressed.
> 
> I can have a go at it and see if I can reproduce the crash with it. Not sure
> I'll be the one skilled enough for the full review though ;)
> 
> --
> Cedric
> 


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]