[dm-devel] [mdadm PATCH 4/4] Create: tell udev device is not ready when first created.

Mon May 1 04:35:17 UTC 2017

On Fri, Apr 28 2017, Peter Rajnoha wrote:

> On 04/28/2017 05:55 AM, NeilBrown wrote:
>> On Wed, Apr 26 2017, Peter Rajnoha wrote:
>> 
>>> On 04/20/2017 11:35 PM, NeilBrown wrote:
>>>> If we wanted an more permanent udev rule, we would need to record the
>>>> devices that should be ignored in the filesystem somewhere else.
>>>> Maybe in /run/mdadm.
>>>> e.g.
>>>>
>>>>  KERNEL=="md*", TEST="/run/mdadm/creating-$kernel", ENV{SYSTEMD_READY}="0"
>>>>
>>>> Then we could have something like the following (untested) in mdadm.
>>>> Does that seem more suitable?
>>>>
>>>
>>> Yes, please, if possible, go for a permanent udev rule surely - this
>>> will make it much easier for other foreign tools to hook in properly if
>>> needed and it would also be much easier to debug.
>> 
>> I'm leaning towards the files-in-/run/mdadm approach too.  I'll make a
>> proper patch.
>> 
>>>
>>> But, wouldn't it be better if we could just pass this information ("not
>>> initialized yet") as RUN_ARRAY md ioctl parameter? In that case the md
>>> driver in kernel could add the variable to the uevent it generates which
>>> userspace udev rules could check for easily. This way, we don't need to
>>> hassle with creating files in filesystem and the information would be
>>> directly a part of the uevent the md kernel driver generates (so
>>> directly accessible in udev rules too). Also, possibly adding more
>>> variables for other future scenarios if needed.
>> 
>> When would we clear the "not initialised yet" flag in the kernel, and
>> how?  And would that be enough.
>
> The flag wouldn't be stored in kernel, md kernel driver would just pass
> the flag with the uevent as it received in with ioctl/sysfs request to
> create a new dev. The udev in userspace would handle the state
> transition then from "flagged as not-initialized" state to "initilized"
> by checking the sequence of events as they come.
>
> We should have this sequence I assume:
>
>   1) "add" (creating dev, not usable yet)
>   2) "change" (activated dev, but not initialized yet)
>   3) "synthetic change" (after wiping the dev and closing it, the WATCH
> rule fires)
>

"Should" is arguable, but there are no guarantees of this sequence.

A particular cause of irregular sequencing is when an array is assembled
by the initrd, then after the real root is mounted, something runs
  udevadm trigger --action=add
(e.g. systemd-udev-triggger.service).

In that case, 'add' is the first and only message that the
full-root-available udev sees, so it is not safe to ignore the array as
"not usable yet".

>> 
>> When mdadm creates an md array, at least 3 uevents get generated.
>> The first is generated when the md device object is created, either by
>> opening the /dev/mdXX file, or by writing magic to somewhere in sysfs.
>> The second is generated when the array is started and the content is
>> visible.
>> The third is generated when mdadm closes the file descriptor.  It opens
>> /dev/mdXX for O_RDWR and performs ioctls on this, and then closes it.
>> Because udev uses inotify to watch for a 'close for a writable file
>> descriptor', this generates another event.
>> 
>> We need to ensure that none of these cause udev to run anything that
>> inspects the contents of the array.
>> Of the three, only the second one is directly under the control of the
>> md module, so only that one can add an environment variable.
>> 
>> It might be possible to avoid the last one (by not opening for write),
>> but I cannot see a way to avoid the first one.
>
> So the first event is the "add" event ("md device object created") - in
> this case, the device is not yet usable anyway I suppose, so we can skip
> udev scans for this event right away (unless it's the synthetic "add"
> event which is generated by writing "add" to /sys/block/.../uevent file
> or alternatively using udevadm trigger - but we should be able to
> recognize this event "synthetic add event" because it always comes after
> the activating "change" event, otherwise we can skip scans).

"not yet usable anyway" is not reliable.  It is very easy for mdadm to
finish making the array usable before udev gets around to processing the
initial "add" event.

You seem to be suggesting that udev rules should try to reverse-engineer
the sequence of events and deduce what is meant from that sequence.  I
doubt that would be very robust.

>
> The second event, which is the "change" event, would be marked with the
> new "not initialized" flag. And so we skip the scans in udev too.

(one obvious problem with this approach is that it cannot work with old
kernels, while my approach only requires an update to mdadm)

>
> Then mdadm opens the devive, clears any old content/signatures the data
> area may contain, then closes it - this generates the third event -
> which is the "synthetic change" event (as a result of the inotify WATCH
> rule). And this one would drop the "not initialized" flag in udev db and
> the scans in udev are now enabled.

Nope, it would be incorrect for mdadm to clear any old content.
Sometimes people want to ignore old content.  Sometimes they very
definitely want to use it.  It would be wrong for any code to try to
guess what is wanted.

>
> So we should be able to handle all three kinds of events I think.
>
> Now, as for even better synthetic event recognition, I've proposed a
> patch recently where udev as well as any other tool generating these
> synthetic events can add variables for more detailed event
> identification, so this one should help us in the future even more in
> these situations: https://lkml.org/lkml/2017/3/15/461. With this, we can
> even disable the WATCH rule till the device is properly initialized and
> the tool can generate the event itself by writing the
> /sys/block/.../uevent file with a variable that the tool can then wait
> for even (so the tool doesn't exit till the device is not properly
> initialized). Once this initialization is all done, the WATCH rule can
> be enabled for the dev. Also, with this, we don't need to be afraid that
> some other tool fired the WATCH rule by chance if it opened the dev for
> RW and closed it by mistake before we had a chance to initialize it
> (which would fire the synthetic change event before the
> wiping/initialization).

It sounds to me like you are adding a lot of complexity to an
already-fragile system.  I'm not filled with confidence - sorry.

Thanks,
NeilBrown

>
>> 
>> I don't think that making a file appear in /run is really very different
>> from making a variable appear in a uevent.   If the variable were
>> describing the event itself there would be a different, but it would be
>> describing the state of the device.  So the only important difference is
>> "which is easier to work with".  I think creating an deleting a file is
>> easier to setting and clearing a variable.
>> 
>
> Yes, I agree that it's an alternative solution - it definitely and
> surely improves current situation, either if we choose to write the file
> or pass the flag in the uevent directly. It's just that with the
> information written in filesystem, we have something external to check
> for in addition to processing the uevent variables while we don't need
> to, I think. As I described the sequence of events above, I think we
> should be able to recognize the events properly and we should be able to
> drop the flag automatically.
>
> -- 
> Peter
>
> --
> dm-devel mailing list
> dm-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20170501/ec73fc1d/attachment.sig>