[dm-devel] more multipath deadlocks -- this time involving memory

goggin, edward egoggin at emc.com
Tue Mar 22 02:34:20 UTC 2005


Looks like some troublesome deadlock issues involving multipath, memory,
and all-paths-down use cases.  While one might typically expect such a use
case to result in errors, deadlock is not to be expected.  Furthermore, for
non-
destructive ucode upgrades of an EMC CLARiion storage system, it is expected
that for a short period of time, all paths to the storage system in question
will
appear to a host to be failed.  It is expected that any multipathing
solution will
ride through this NDU scenario without a problem.

While I see three separate instances of the problem being plausible, I have
only
seen the first problem instance described below.  The second and third
scenarios
require high levels of memory contention which I have not spent significant
time
creating.

The first problem scenario involves a deadlock between multipathd and
syslogd.  The second scenario involves the potential for multipathd,
multipath,
or any of the executables invoked by multipath to be deadlocked trying doing
synchronous page reclamation while allocating memory pages for user or
kernel heap memory in a system with a high degree of memory contention
and several multipath mapped devices in an all-paths-down failure state.
The third scenario is extremely similar to the second but involves the need
to allocate pages not for heap memory but to swap in working set pages for
the multipathd, multipath, or any of the executables invoked by multipath.

First, it seems like __every__ time I try an NDU of the EMC CLARiion ucode,
one of the two (checkerloop or waiterloop) multipathd sub-threads gets
blocked in unix_wait_for_peer waiting to send a syslog message through
a UNIX domain socket to syslogd.  Unfortunately, syslogd is blocked in
blk_congestion_wait waiting for the number of dirty pages in the page
cache to drop below a pre-defined threshold while it was trying to write log
info to its /var/log/messages log file.  Unfortunately, getting this to
happen
is dependent on the multipathd checkerloop thread periodically checking
path connectivity and invoking multipath in order to reconfigure multipath
maps and/or re-enable some now valid paths.  Since the multipathd
waiterloop event thread will deadlock on the multipathd allpaths mutex
currently owned by the checkerloop thread, starting i/o on a failed path
will not free up the log jam.  Assuming enough free memory is available
to do so, manually running multipath often resolves the problem.  Yet,
this is hardly a work around to be recommended to an enterprise customer.

I have only been able to avoid this deadly embrace by killing syslogd
before starting the test.  Without syslogd running, I made it through
this test 3 consecutive times.  It seems that I cannot get through the test
at all with syslogd running.  I think simply changing syslogd to do direct
i/o
instead of page cache buffered i/o to its log file(s) will avoid this
problem.  I am
running with 2.6.11-rc3-upd2 kernel and 0.4.3-pre9 multipath tools by the
way.

The 2nd scenario involves blockable user or kernel memory allocation
requests
requiring page write-out of dirty pages on multipath mapped devices in the
synchronous page reclaim algorithm of __alloc_pages.  Seems to me that while
mlockall can pin all current and future pages of a process's working set, it
does
not prevent synchronous page reclamation by the process as part of a
blockable
page allocation request.  If many of the mapped devices are in queue
congestion
mode due to failed paths on a storage system which is queuing failed bios
(as
the EMC CLARiion must), multipathd, multipath, or any executable invoked by
multipath could block trying to page out dirty pages to these mapped devices
while trying to allocate memory before being able to inform the kernel
resident
multipath components of the existence of valid paths for these mapped
devices.

The 3rd scenario involves the need to mlockall for all executables which are
invoked by multipathd (multipath) and the executables invoked by these
executables (scsi_id, /bin/false, ...).  Otherwise, any of these executables
can
block during page reclamation while trying to allocate free pages.  Also,
does the
effect of mlockall survive in the parent beyond fork/clone call or does it
need to be
renewed afterwards?

Overall, it seems like the code path to test and restore a target path of a
multipath
mapped device should not require any blockable memory allocations.  This
would
of course rule out fork/clone/exec.  A possible alternative design to the
current one
is to pre-allocate or reserve the memory requirements for these tasks --
possibly
enough memory for testing and restoring a single path to a single LU at a
time.
While this design would be tuned specifically to this job, I think it would
not need
to be kernel resident.




More information about the dm-devel mailing list