[Pulp-dev] Proposal: Changing in 3.1 that Artifact.save() will hard-link/copy, not move

Brian Bouterse bmbouter at redhat.com
Fri Jan 10 21:40:34 UTC 2020


I was able to get this working, and with a double ack I merged it. Further
refinements are welcome.

https://github.com/pulp/pulpcore/pull/486

On Fri, Jan 10, 2020 at 9:56 AM Brian Bouterse <bmbouter at redhat.com> wrote:

> Here's a summary of what I've learned about our options to resolve the two
> issues linked in my original email. tl;dr we should maintain a "move"
> behavior like we have now whenever that file isn't file:///.
>
> It's inefficient to duplicate data that was uploaded from the user, or
> downloaded with the http downloader (this is a speed issue). It's also
> undesirable in size; if you're syncing a large RHEL you would spend maybe
> 20ish gigs in the working directory, and another 20 in the storage backend.
> This is undesirable, so we should stick with moving data whenever we can.
>
> So for the case of syncing from file:///, that is the one where we need to
> copy data. So how should we efficiently copy data?
>
> reflinks would be sweet for copying and I prototyped it, but Pulp would
> pick up a dependency on 'reflink'
> https://reflink.readthedocs.io/en/latest/ which requires gcc client build
> at install time. That causes it to fail on Travis. I don't think reflink is
> a significant enough benefit to cause our pip installs of pulpcore to
> require a non-pip dependency. What do you think about this tradeoff?
>
> Also, reflinks will matter less over time. We would use shutil.copyfile()
> to perform the copying and in Python 3.8 they already improved the
> implementation when the source and destination are able to share inodes (
> https://bugs.python.org/issue33671 ). Also they are adding new calls with
> a full reflink implementation in Python 3.9 AIUI (
> https://bugs.python.org/issue37157 ).
>
> Feedback and ideas are welcome. On the current course I'm going to focus
> on having file:/// copy, leaving the existing "move" as is. Also I'm going
> to update the Storage backend to use the _save() method instead of save()
> which is the recommended way django says to customize our storage system.
>
>
>
> On Wed, Jan 8, 2020 at 6:10 PM Brian Bouterse <bmbouter at redhat.com> wrote:
>
>> It's not finished, but I have a POC with tests passing locally for me
>> here: https://github.com/pulp/pulpcore/pull/486 It does use the reflink
>> library and falls back to copy.
>>
>> What still needs to be done is:
>> * To ensure files are being cleaned up in the case of uploads.
>> * Also I think we have a bug where some cases may not cleanup the worker
>> directory which would now leave those files around forever.
>>
>> I will resume on those things tomorrow.
>>
>> On Wed, Jan 8, 2020 at 12:32 PM Brian Bouterse <bmbouter at redhat.com>
>> wrote:
>>
>>>
>>>
>>> On Wed, Jan 8, 2020 at 12:02 PM Simon Baatz <gmbnomis at gmail.com> wrote:
>>>
>>>> On Tue, Jan 07, 2020 at 04:45:54PM -0500, Brian Bouterse wrote:
>>>> >    We had two bugs filed recently [0][1] which suggest that when
>>>> using the
>>>> >    default backend for Pulp, i.e.
>>>> pulpcore.app.models.storage.FileSystem
>>>> >    Pulp should not be "moving" files. This is the default behavior
>>>> Django
>>>> >    gives us, and it destroys data when sync'ed from file:/// for
>>>> example
>>>> >    [1].
>>>>
>>>> I am surprised that file:/// should be supported at all. This means
>>>> that a worker process must be able to access basically any file on the
>>>> system. Is this covered by the (upcoming?) SELinux policy? I would
>>>> expect workers to be more constrained than that. Or do we expect
>>>> the user to label files before importing?
>>>>
>>> The user would need to label the files ahead of time. The use case is
>>> that a large amount of content is stored on a hard drive which is mounted
>>> on the worker. A lot of setups of Pulp use this as a mass-import method
>>> before switching their sync's back to the CDN.
>>>
>>>
>>>>
>>>> >    I propose that with 3.1 we fix this bug by switching
>>>> >    pulpcore.app.models.storage.FileSystem to leave files in place and
>>>> >    either hard-link (same filesystem) or copy (different filesystem).
>>>>
>>>> We should not use hard links:
>>>>
>>>> - On systems with sysctl "fs.protected_hardlinks" enabled (Ubuntu,
>>>> Fedora),
>>>>   the files would have to be owned by the pulp user to be able to
>>>>   create hard-links at all.
>>>> - On systems with SELinux enabled, these hard links will share
>>>>   the labels.
>>>> - Imported hard links are not protected against modification of the
>>>>   original file.
>>>>
>>>> We could try to reflink the file and fall-back to copy (like "cp
>>>> --reflink=auto" does)
>>>>
>>> Great idea; let me try doing it this way. I'll link to my POC pr when
>>> it's up. Feel free also to share one.
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20200110/bb659471/attachment.htm>


More information about the Pulp-dev mailing list