Duplicated files in the pristine FC4t2 installation

Jindrich Novy jnovy at redhat.com
Tue May 3 15:28:08 UTC 2005


On Mon, 2005-05-02 at 16:25 -0400, Peter Jones wrote:
> On Mon, 2005-05-02 at 12:35 -0700, Roland McGrath wrote:
> > > Roland McGrath wrote:
> > > > I think what one clearly wants is for rpm to maintain an installed file
> > > > indexed keyed by md5sum.  Then you can have a tool that just uses this
> > > > database to identify duplicates (and doesn't take forever), or have rpm do
> > > > so itself when installing new files.
> > > > 
> > > 
> > > Hmm, what about hash collisions, that would be really really BAD
> > 
> > If you are concerned about them you can still compare contents before
> > declaring two files identical.  But using the hashes as the main detector
> > makes it fast, since you only examine the data of files that are 99.999%
> > likely to be identical.
> 
> And in the vast majority of cases, there's a simpler heuristic you can
> use first: is the basename the same?

The easiest way seems to be only to stat all the files to be compared,
put all info to some array of pointers to the info structures, sort the
array by size [this will automagically detect all zero-sized files that
won't be linked and are skipped] then just go from top to bottom in the
array and check in-depth all the files with equal size, i.e. byte-by-
byte compare during the md5sum is calculated. This avoids all the md5sum
collisions. This is how it's done in the slink utility, the md5sums are
printed in the log just FYI and isn't used as a measure of file
equality. The basename heuristics seems less reliable and more
calculation-time/design expensive to me.


Jindrich

-- 
Jindrich Novy <jnovy at redhat.com>, http://people.redhat.com/jnovy/

The worst evil in the world is refusal to think.




More information about the fedora-devel-list mailing list