Finding Duplicate Files
Les Mikesell
lesmikesell at gmail.com
Fri Mar 14 22:10:09 UTC 2008
Alan wrote:
>>
>> (if that makes sense). rsync --compare-dest and --link-dest : fantastic.
>
> I wrote a program MANY years back that searches for duplicate files. (I
> had a huge number of files from back in the BBS days that had the same
> file but different names.)
>
> Here is how I did it. (This was done using Perl 4.0 originally.)
>
> Recurse through all the directories and build a hash of the file sizes.
> Go through the hash table and look for collisions. (This prevents you
> from doing an MD5SUM on very large files that occur once.) For each set
> of collisions, build a hash table of MD5SUMS (the program now uses
> SHA512). Take any hash collisions and add them to a stack. Prompt the
> user what to do with those entries.
>
> There is also another optimization to the above. The first hash should
> only take the first 32k or so. If there are collisions, then hash the
> whole file and check for collisions on those. This two pass check speeds
> things up by a great deal of you have many large files of the same size.
> (Multi-part archives, for example.) Using this method I have removed all
> the duplicate files on a terabyte drive in about 3 hours or so. (Without
> the above optimization.)
I suppose it is a little late to mention this now, but backuppc
(http://backuppc.sourceforge.net/) does this automatically as it copies
in files and compresses them in addition to eliminating the duplication.
If you used it instead of an ad-hoc set of copies as backups in the
first place you'd have a web browser view of everything in its original
locations at the backup intervals, but taking up less space that one
original copy (depending on the amount of change...).
--
Les Mikesell
lesmikesell at gmail.com
More information about the fedora-list
mailing list