OT: autosave of google alert sites?

Wed Oct 13 22:24:07 UTC 2004

Dave Stevens wrote:
> I get google news alerts 
> In an ideal world, I would be able to have a daily program (script?) run 
> that would examine that day's alerts, resolve the URLs and save the pages. 

Alan Peery suggested:
> 1. Use wget to retrieve the google news alert page to a file
> 2. parse the file with PERL, gaining URLs
> 3. wget those URLs, putting them into subdirectories based on the day 
> your script is running

I don't think stage 2 is necessary: man wget suggests:
       -i file
       --input-file=file
           Read URLs from file, in which case no URLs need to be on the com-
           mand line.  If there are URLs both on the command line and in an
           input file, those on the command lines will be the first ones to be
           retrieved.  The file need not be an HTML document (but no harm if
           it is)---it is enough if the URLs are just listed sequentially.

But then I'm not sure stage 3 is necessary, either: wget supports
recursive retrieval:
       -r
       --recursive
           Turn on recursive retrieving.

       -l depth
       --level=depth
           Specify recursion maximum depth level depth.  The default maximum
           depth is 5.

Take a good look at the options in the wget man page, especially the
examples under --page-requisites. You may need --span-hosts.

(I must admit that I've never really tried using these options, so
you'll need to experiment.)

James.

-- 
E-mail address: james | A: Because people don't normally read bottom to top.
@westexe.demon.co.uk  | Q: Why is top-posting such a bad thing?
                      | A: Top-posting.
                      | Q: What is the most annoying thing in e-mail?