rpm hashes

Panu Matilainen pmatilai at laiskiainen.org
Wed May 20 08:19:08 UTC 2009


On Thu, 14 May 2009, Adam Jackson wrote:

> On Thu, 2009-05-14 at 10:46 +0300, Panu Matilainen wrote:
>> On Wed, 13 May 2009, Adam Jackson wrote:
>>> It would have been really, _really_ nice if sha256 was merely another
>>> hash that could be in the payload, instead of forcing you to pick one or
>>> the other.  For that matter, it would still be really really nice.
>>
>> Could it have been done that way? Yes, and if it were just per-package
>> hash then certainly it would've been done that way. But remember this is
>> per-file data, storing two (and when the day comes when sha256 is
>> considered insufficient, three etc) hashes per file adds a non-trivial
>> amount of header bloat.
>
> 32 bytes per file, plus another four for the header tag, unless I have
> my math wildly wrong and/or I'm misremembering how hashes are stored.
> My F11 machine has 430910 files over 2167 packages, so that extra
> metadata comes to a massive 14.8M, compared to 11.6G of actual payload.
> I have trouble getting worked up over this.

People scream BLOAT! for lesser issues. It's data that gets transfered 
over the wire(less) over and over again, stored on disk in rpmdb (for the 
average desktop/server its completely irrelevant but not so for smaller 
devices) .. and the header data size is (artificially) limited to 16MB. 
Increasing that limit is possible and will sooner or later be necessary 
(people are occasionally hitting it already), but it's another 
incompatibility: all the widely deployed versions of rpm will think of 
a package with > 16MB header as corrupted, refusing to read it at all.

> The point about having to store arbitrarily many hashes is certainly
> fair, but a) sha512 is only twice as large as sha256, and 0.2% overhead
> is still not a lot, b) that seems like a distro policy question.
>
>> Having the md5 hashes too would've been nice for backwards compatibility
>> but actually using them for file conflict calculations would mean (in
>> addition to the header bloat):
>> - considerable increase in memory use
>
> I just don't buy this at all.  The checksums are computed as part of the
> stdio stream, and any competent implementation of a SHA-like algorithm
> requires storage that's O(n) on the size of the hash, not on the size of
> the file.  So you'd need whatever the overhead is for the additional
> metadata on the package you're currently inspecting, plus no more than a
> page for the additional work area for the second hash.  (I assume here
> that fileconflict checks are done one package at a time, not by loading
> all packages into memory and then checking them for conflicts, since the
> latter would be unusable.)

Well the assumption is wrong: during file conflict checking, all 
file-related data of non-installed packages is kept in memory, the full 
headers that are fed into transaction are discarded to - guess what - save 
memory, only the absolutely necessary file data is kept. For installed 
packages, rpm can and does fetch them one at a time from rpmdb as 
necessary, but for to-be-installed packages, rpm doesn't have the header 
so it can't go back to them as needed.

> Oh, I guess there's also a case where you have to check for
> fileconflicts among multiple packages in the same transaction laying
> down the same files.  Handwave, same problem really.
>
>> - falling back to md5 for conflict resolution would void the supposed
>>    extra security of the better hash
>
> So there's two cases, if rpm would let you carry both hashes.
>
> 1 is where the file on disk has both MD5 and SHA256 sums, and the new
> package has only MD5.  You already trust the package on disk, because
> you already installed it; so compute the SHA256 of the file you're about
> to lay down!  Now you have both hashes, and you can compare them both.
> The odds of defeating this are the odds of finding a payload that
> collides for both MD5 and SHA256, which can't possibly be lower than the
> odds of finding a collision for just SHA256 itself.
>
> 2 is where the file on disk has only MD5, and the package you're about
> to install has both.  If you have an rpm that only understands MD5, then
> whatever, you just ignore the SHA256 hash.  If you have an rpm that
> understands both, then you have options.  If you're being sensible, you
> do the same thing as for case 1, which is to generate the SHA256 of the
> disk file that's implicitly already trusted and compare both sums, and
> presumably you only got to this point because you trust the GPG key that
> signed the package you're about to install, so, good enough.  (There's a
> flaw here if the file on disk is modified.  I could see arguments here
> for any of rpmnew/rpmsave/fileconflict as the "right thing", which I
> leave to someone more detail-oriented than I am.)
>
> If you're in FIPS mode - that is, if you're _not_ being sensible - then
> you fail the transaction, which you ought rightly do anyway since oh no
> the package on disk is only hashed with MD5, you're already in trouble.

3) You're installing two new packages with a common file where the other 
only has md5 hashes and the other has md5 + a stronger hash. Okay, assume 
a "FIPS mode" exists and it's mostly same as above, either be anal about 
it or not.

But back to the existing implementation: sure it isn't optimal, sure it 
would be nice if it were backwards compatible all the way to RHEL 2.1 or 
whatever. It's a trade-off on several fronts, due to many different 
aspects: limitations of fundamental rpm architecture (inability to 
calculate the hash from payload on demand), efficiency (memory footprint, 
bandwidth etc), compatibility (see the point about header size, just 
stuffing more and more data there can make things even more 
incompatible)... and I'm a bit tired of people assuming no thought 
whatsoever was given to the way its done.

 	- Panu -




More information about the fedora-devel-list mailing list