[Pulp-list] Download analytics from CDN?

Wed Apr 7 01:37:13 UTC 2021

Thanks for following up. Yes, the query string *should* be there. I found
this bug last week when I was looking in to it, though (basically, telling
Django-storages to use cloudfront breaks the query string appending code).
I'm back from away-from-keyboard vacation tomorrow, and should be able to
get a some patches sent upstream. :)

https://github.com/jschneier/django-storages/issues/997

--Danny

On Tue, Apr 6, 2021, 2:07 PM David Davis <daviddavis at redhat.com> wrote:

> Hi Danny,
>
> I don't know much about AWS logging but Pulp does set the filename in the
> response-content-disposition[0]. Could that be used to determine the
> filename for each request?
>
> If not, I'm looking at the boto3 docs for get_object[1] to see if there's
> another parameter we could set to help you track the filename in requests
> but I'm seeing anything useful. My knowledge of s3 is a bit limited so if
> you have a suggestion how we can construct a request to S3 that would help
> you to track the filenames of requests to s3, I could probably look at how
> we could support it in Pulp 3.
>
> [0]
> https://github.com/pulp/pulpcore/blob/f38f955425b185749b3c8d4d878a7e166cfc05b9/pulpcore/content/handler.py#L613-L614
> [1]
> https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.get_object
>
> David
>
>
> On Tue, Mar 30, 2021 at 10:43 AM Danny Sauer <danny.sauer at konghq.com>
> wrote:
>
>> I've got Pulp set up to serve all the content from S3 behind CloudFront.
>> This works really well, except for a minor issue: the content URLs are all
>> the UUIDs for artifacts, not, for example, the pretty name of the RPM being
>> downloaded.  That's an issue in my situation because we'd really like to
>> generate download analytics using off-the-shelf tools which consume the AWS
>> CDN standard log format.
>>
>> My initial thought was that it might be easy to have the redirects
>> include a query string in the generated URL which notes the original
>> filename or relative path requested.  But I don't have sufficiently
>> developed Django skills to know the easiest way to do that (or if it's even
>> reasonable to think that's easy).  Using the content server's logs is
>> another option, but I have some other content on the same S3 bucket which
>> may not necessarily be reached solely through Pulp's content server, so
>> that means two log locations, etc.  If it was easy to make Django /
>> Gunicorn log to an S3 bucket in a manner similar to Cloudfront, that might
>> also be ok.  Post-processing logs with a series of API calls to work out
>> what artifact maps to what repository content would ideally be a last
>> resort.
>>
>> Anyone have some great insights which might help me out here? :)  If it
>> helps, I'm building my own Docker images which ultimately run in EKS.  So
>> patches / extra modules are an option, but I'd prefer to stay as close to
>> vanilla upstream as possible with environment variable-based config
>> adjustments.
>>
>> Thanks.
>> --Danny
>> _______________________________________________
>> Pulp-list mailing list
>> Pulp-list at redhat.com
>> https://listman.redhat.com/mailman/listinfo/pulp-list
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-list/attachments/20210406/b651df98/attachment.htm>