OT: autosave of google alert sites?

Nifty Hat Mitch mitch48 at sbcglobal.net
Thu Oct 14 17:07:29 UTC 2004


On Wed, Oct 13, 2004 at 11:24:07PM +0100, James Wilkinson wrote:
> Dave Stevens wrote:
> > I get google news alerts 
> > In an ideal world, I would be able to have a daily program (script?) run 
> > that would examine that day's alerts, resolve the URLs and save the pages. 
> 
> Alan Peery suggested:
> > 1. Use wget to retrieve the google news alert page to a file
> > 2. parse the file with PERL, gaining URLs
> > 3. wget those URLs, putting them into subdirectories based on the day 
> > your script is running
> 
> I don't think stage 2 is necessary: man wget suggests:
>        -i file
> 
> Take a good look at the options in the wget man page, especially the
> examples under --page-requisites. You may need --span-hosts.
> 
> (I must admit that I've never really tried using these options, so
> you'll need to experiment.)

Of interest many of the interesting large sites (google, yahoo, Ebay,
etc) have tricks to foil the automated slurping of data.

If you are not greedy most wget tricks work but if you trigger their
'abuse' meter things go sideways quickly.  In addition to the
bandwidth abuse there are issues with copyright. Perhaps not a problem
for your personal use but do not 're-publish' inadvertantly.

-- 
	T o m  M i t c h e l l 
	May your cup runneth over with goodness and mercy
	and may your buffers never overflow.




More information about the fedora-list mailing list