[dm-devel] [QUESTION]: multipath device with wrong path lead to metadata err

lixiaokeng lixiaokeng at huawei.com
Wed Jan 20 02:30:58 UTC 2021


Hi Martin:
    Thanks for your reply.


> verify_paths() would detect this. We do call verify_paths() in
> coalesce_paths() before calling domap(), but not immediately before.
> Perhaps we should move the verify_paths() call down to immediately
> before the domap() call. That would at least minimize the time window
> for this race. It's hard to avoid it entirely. The way multipathd is
> written, the vecs lock is held all the time during coalesce_paths(), 
> and thus no uevents can be processed. We could also consider calling
> verify_paths() before *and* after domap().

Can calling verify_paths() before *and* after domap() deal this entirely?

> Was this a map creation or a map reload? Was the map removed after the
> failure? Do you observe the message "ignoring map" or "removing map"?
>
> Do you observe a "remove" uevent for sdi? 

This was a map reload but sdi was not in old map. The  "removing map"
was observed. The "remove" uevent for sdi was not observed here.

> I wonder if you'd see the issue also if you run the same test without
> the "multipath -F; multipath -r" loop, or with just one. Ok, one
> multipath_query() loop simulates an admin working on the system, but 2
> parallel loops - 2 admins working in parallel, plus the intensive
> sequence of actions done in multipathd_query at the same time? The
> repeated "multipath -r" calls and multipathd commands will cause
> multipathd to spend a lot of time in reconfigure() and in cli_* calls
> holding the vecs lock, which makes it likely that uevents are missed or
> processed late.

As you said, there were lots of cli_* calls but no uevent when error
caused. And after finishing them, hundreds of uevent will be found (for
example ,"Forwarding 201 uevents" in log).

> Don't get me wrong, I don't argue against tough testing. But we should
> be aware that there are always time intervals during which multipathd's
> picture of the present devices is different from what the kernel sees.

What you said is very reasonable. When this problem was found, I think
it is difficult to solve that entirely, while it is hard to happen. Well,
I will discuss the rationality of test scripts with testers.

> There's definitely room for improvement in multipathd wrt locking and
> event processing in general, but that's a BIG piece of work.

Thanks again!
Regards
Lixiaokeng





More information about the dm-devel mailing list