[Libguestfs] [PATCH nbdkit v2] curl: Implement header and cookie scripts.

Mon Jul 20 15:20:48 UTC 2020

On 7/15/20 3:53 PM, Richard W.M. Jones wrote:
> This rather complex feature solves a problem for certain web services
> that require a cookie or token for access, especially one which must
> be periodically renewed.
> 
> For motivation for this feature see the included documentation, and
> item (1)(b) here:
> 
> https://www.redhat.com/archives/libguestfs/2020-July/msg00069.html
> ---

I see you pushed this in the meantime, but a late review is better than 
none ;)  As usual, a stream-of-consciousness single pass through:

>   
> +=head1 HEADER AND COOKIE SCRIPTS
> +
> +While the C<header> and C<cookie> parameters can be used to specify
> +static headers and cookies which are used in every HTTP/HTTPS request,
> +the alternate C<header-script> and C<cookie-script> parameters can be
> +used to run an external script or program to generate headers and/or
> +cookies.  This is particularly useful to access services which require
> +an authorization token.  In addition the C<header-script-renew> and
> +C<cookie-script-renew> parameters allow you to renew the authorization
> +token by rerunning the script periodically.

What happens if you have header-script-renew=5, but a very slow client 
that sits idle more more than 5 seconds at a time?  Are you burning CPU 
time every five seconds just to throw it away, or are you letting it 
idle until the start of each client command where you check whether the 
time since last renew is larger than the limit.  (I guess that's 
implementation, not worth documenting, so I'll see if I spot it below)

> +
> +C<header-script> is incompatible with C<header>, and C<cookie-script>
> +is incompatible with C<cookie>.
> +
> +=head2 Header script
> +
> +The header script should print zero or more HTTP headers, each line of
> +output in the same format as the C<header> parameter.  The headers
> +printed by the script are passed to L<CURLOPT_HTTPHEADER(3)>.
> +
> +In the following example, an imaginary web service requires
> +authentication using a token fetched from a separate login server.
> +The token expires after 60 seconds, so we also tell the plugin that it
> +must renew the token (by re-running the script) if more than 45
> +seconds have elapsed since the last request:
> +
> + nbdkit curl https://service.example.com/disk.img \
> +        header-script='
> +          echo -n "Authorization: Bearer "

'echo -n' is not 100% portable; better might be using 'printf'

> +          curl -s -X POST https://auth.example.com/login |
> +               jq -r .token
> +        ' \
> +        header-script-renew=50

Example is inconsistent with text (45 vs. 50), but it looks like you 
answered my question above, and DO let the script idle when the client 
is idle.

> +=head2 Example: VMware ESXi cookies
> +
> +VMware ESXi’s web server can expose both VMDK and raw format disk
> +images, but requires you to log in using HTTP Basic Authentication.
> +While you can use the C<user> and C<password> parameters to send HTTP
> +Basic Authentication headers in every request, tests have shown that
> +it is faster to accept the cookie which the server returns and send
> +that instead.  (It is not clear why it is faster, but one theory is
> +that VMware has to do a more expensive username and password check
> +each time.)

The perils of having to reverse-engineer closed-source software behavior ;)

> +
> +The web server can be accessed as below.  Since the cookie expires
> +after a certain period of time, we use C<cookie-script-renew>, and
> +because the server uses a self-signed certificate we must use
> +I<--insecure> and C<sslverify=false>.
> +
> + SERVER=esx.example.com
> + DCPATH=data
> + DS=datastore1
> + GUEST=guest-name
> + URL="https://$SERVER/folder/$GUEST/$GUEST-flat.vmdk?dcPath=$DCPATH&dsName=$DS"
> +
> + nbdkit curl "$URL" \
> +        cookie-script='
> +            curl --head -s --insecure -u root:password "$url" |
> +                 sed -ne '{ s/^Set-Cookie: \([^;]*\);.*/\1/ip }'
> +        ' \

Umm, this is nesting '' inside ''.   You probably need '\'' in the sed line.

> +        cookie-script-renew=500 \
> +        sslverify=false
> +
> +=head2 Example: Docker Hub authorization tokens
> +
> +Accessing objects like container layers from Docker Hub requires that
> +you first fetch an authorization token, even for anonymous access.
> +These tokens expire after about 5 minutes (300 seconds) so must be
> +periodically renewed.
> +
> +You will need this authorization script (F</tmp/auth.sh>):
> +
> + #!/bin/sh -
> + IMAGE=library/fedora
> + curl -s "https://auth.docker.io/token?service=registry.docker.io&scope=repository:$IMAGE:pull" |
> +      jq -r .token
> +
> +You will also need this script to get the blobSum of the layer
> +(F</tmp/blobsum.sh>):
> +
> + #!/bin/sh -
> + TOKEN=`/tmp/auth.sh`
> + IMAGE=library/fedora
> + curl -s -X GET -H "Authorization: Bearer $TOKEN" \
> +      "https://registry-1.docker.io/v2/$IMAGE/manifests/latest" |
> +      jq -r '.fsLayers[0].blobSum'
> +
> +Both scripts must be executable, and both can be run on their own to
> +check they are working.  To run nbdkit:
> +
> + IMAGE=library/fedora
> + BLOBSUM=`/tmp/blobsum.sh`
> + URL="https://registry-1.docker.io/v2/$IMAGE/blobs/$BLOBSUM"
> +
> + nbdkit curl "$URL" \
> +        header-script=' echo -n "Authorization: Bearer "; /tmp/auth.sh ' \

Another s/echo -n/printf/

> +        header-script-renew=200 \
> +        --filter=gzip
> +
> +Note that this exposes a tar file over NBD.  See also
> +L<nbdkit-tar-filter(1)>.
> +
>   =head1 DEBUG FLAG
>   

> +++ b/plugins/curl/curl.c
> @@ -48,9 +48,11 @@
>   
>   #include <nbdkit-plugin.h>
>   
> -#include "cleanup.h"
>   #include "ascii-ctype.h"
>   #include "ascii-string.h"
> +#include "cleanup.h"
> +
> +#include "curldefs.h"
>   
>   /* Macro CURL_AT_LEAST_VERSION was added in 2015 (Curl 7.43) so if the
>    * macro isn't present then Curl is very old.
> @@ -61,24 +63,29 @@
>   #endif
>   #endif
>   
> -static const char *url = NULL;  /* required */
> +/* Plugin configuration. */
> +const char *url = NULL;         /* required */
>   
> -static const char *cainfo = NULL;
> -static const char *capath = NULL;
> -static char *cookie = NULL;
> -static struct curl_slist *headers = NULL;
> -static char *password = NULL;
> -static long protocols = CURLPROTO_ALL;
> -static const char *proxy = NULL;
> -static char *proxy_password = NULL;
> -static const char *proxy_user = NULL;
> -static bool sslverify = true;
> -static bool tcp_keepalive = false;
> -static bool tcp_nodelay = true;
> -static uint32_t timeout = 0;
> -static const char *unix_socket_path = NULL;
> -static const char *user = NULL;
> -static const char *user_agent = NULL;
> +const char *cainfo = NULL;
> +const char *capath = NULL;
> +char *cookie = NULL;
> +const char *cookie_script = NULL;
> +unsigned cookie_script_renew = 0;
> +struct curl_slist *headers = NULL;
> +const char *header_script = NULL;
> +unsigned header_script_renew = 0;
> +char *password = NULL;

All of these variables are already guaranteed zero-initialized without 
being explicit,

> +long protocols = CURLPROTO_ALL;

although consistency with this initialization (which is not necessarily 
0) makes sense.

> @@ -450,7 +490,11 @@ curl_open (int readonly)
>   
>     /* Get the file size and also whether the remote HTTP server
>      * supports byte ranges.
> +   *
> +   * We must run the scripts if necessary and set headers in the
> +   * handle.
>      */
> +  if (do_scripts (h) == -1) goto err;

Okay, code matches my guess above - you DO check on each client 
interaction, and run the script as-needed, rather than kicking off a 
background thread that runs the script on a blind timeout cycle.

> +++ b/plugins/curl/scripts.c

> +
> +  /* Set headers and cookies in the handle.
> +   *
> +   * When calling CURLOPT_HTTPHEADER we have to keep the list around
> +   * because unfortunately curl doesn't take a copy.  Since we don't
> +   * know which other threads might be using it, we must make a copy
> +   * of the global list (headers_from_script) per handle
> +   * (h->headers_copy).  For CURLOPT_COOKIE, curl internally takes a
> +   * copy so we don't need to do this.

The comment on memory life cycles is very helpful (and I suspect you 
went through several iterations before figuring out an arrangement that 
works)

> +
> +/* This is called with the lock held when we must run or re-run the
> + * header-script.
> + */
> +static int
> +run_header_script (struct curl_handle *h)
> +{

> +
> +/* This is called with the lock held when we must run or re-run the
> + * cookie-script.
> + */
> +static int
> +run_cookie_script (struct curl_handle *h)
> +{
> +  int fd;

These two look similar, is it worth refactoring into a common helper 
routine?

> +
> +  nbdkit_debug ("cookie-script returned %scookies",
> +                cookies_from_script ? "" : "no ");

This would fail ease-of-translation if we used gettext, but we don't ;)

Overall a pretty slick solution!

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org