Finding Duplicate Files

Fri Mar 14 22:10:09 UTC 2008

Alan wrote:
>>
>> (if that makes sense). rsync --compare-dest and --link-dest : fantastic.
> 
> I wrote a program MANY years back that searches for duplicate files. (I
> had a huge number of files from back in the BBS days that had the same
> file but different names.)
> 
> Here is how I did it. (This was done using Perl 4.0 originally.)
> 
> Recurse through all the directories and build a hash of the file sizes. 
> Go through the hash table and look for collisions.  (This prevents you
> from doing an MD5SUM on very large files that occur once.)  For each set
> of collisions, build a hash table of MD5SUMS (the program now uses
> SHA512).  Take any hash collisions and add them to a stack. Prompt the
> user what to do with those entries.
> 
> There is also another optimization to the above.  The first hash should
> only take the first 32k or so.  If there are collisions, then hash the
> whole file and check for collisions on those.  This two pass check speeds
> things up by a great deal of you have many large files of the same size. 
> (Multi-part archives, for example.)  Using this method I have removed all
> the duplicate files on a terabyte drive in about 3 hours or so.  (Without
> the above optimization.)

I suppose it is a little late to mention this now, but backuppc 
(http://backuppc.sourceforge.net/) does this automatically as it copies 
in files and compresses them in addition to eliminating the duplication. 
  If you used it instead of an ad-hoc set of copies as backups in the 
first place you'd have a web browser view of everything in its original 
locations at the backup intervals, but taking up less space that one 
original copy (depending on the amount of change...).

-- 
   Les Mikesell
    lesmikesell at gmail.com