[Linux-cachefs] Improving cachefilesd culling speed

Fri Nov 8 21:12:28 UTC 2013

== Background

In the event of a nearly full disk or the consumption of most of our 
available inodes, cachefiles will instruct the daemon to begin culling 
files in the order of oldest-to-newest in order to free up resources. 
This can be quite slow for any appreciably large cache.

== First-pass improvement

David Howells produced a time ago part of a patch to improve this scheme 
by introducing into cachefiles a series of page-sized bitmaps which 
associated an ordinal index or slot number to each fscache object. In 
this scheme, each cachefiles file representation of a cache object is 
given an xattribute which indicates its slot number in the bitmap.

A "culling index" file is produced which contains (on an x86_64 machine) 
an array of 26 byte data structures which contain exported file handles 
to the cache files. The position of each structure correlates with the 
slot number in the bitmap in kernel memory.

A separate "culling atimes" file is produced which keeps track 
exclusively of the last access time for each file, using a 32byte 
unsigned integer. The position in this array again correlates with the 
slot in the bitmap. On access, these times are updated -- so we avoid 
relying on the filesystem access time mechanisms.

Now, in cachefilesd, we can avoid recursively scanning the file 
structure of the cachefiles folder and instead begin assembling the 
lowest M integers as seen in the culling atimes file. These M atimes can 
represent a candidate set of (slot, atime) pairs to cull if/when 
requested to do so. Once cachefiles enters its "need to cull" mode, the 
cachefilesd service can pluck the top item off the culling queue and 
request slot #i be culled, to which the kernel will happily delete the 
entry in the bitmap, and mark invalid the entries in cull_index and 
cull_atimes, along with the normal culling activities.

== Continuing Issues

The only problem is that this is still kind of slow for large caches. 
For the rotational disks I have access to, 16 million cache entries can 
take 95-120 msec to assemble the 4K oldest entries as culling 
candidates, when the culling index is already warm in the page cache. If 
the culling is to be absolutely correct /and/ fast, reading the entire 
culling index from disk will need to be avoided. For much larger data 
sets, the cost of simply reading the file becomes quite slow on a 
standard, commodity hardware.

On this point, I am looking for suggestions on how to avoid reading the 
/entire/ cull_atimes file for large data sets to assist in accurate 
culling behaviors, OR for suggestions on reasonable approximations to 
the problem to cull arbitrary items that are at least "likely to be 
among the 25% oldest," as an example.

 1. I had considered keeping a third index with some "statistics per
    page" for cull_atimes, with things like averages, minimums or
    maximums, but keeping track of min/max per page might incur some
    large costs if we change a lot of information within a local block
    of the cull_atimes file. Averages would help reading the file in an
    optimal order, but wouldn't avoid us needing to read the whole
    thing. (And there are precision issues...)

 2. Keeping the atimes sorted is probably a non-starter, because
    rewriting a 64 (16 million atimes) to 4GB (1 billion atimes) is not
    an attractive offer.

 3. It might be possible to create a third index that keeps track of
    roughly what "bin" each slot falls into. e.g, if a cache object has
    an atime of somewhere between 0 seconds from cache creation to 4096,
    8192, etc it could be indexed to bin "0", the next slice up "1", etc.
    In this way, we could at least be certain that even though the first
    M numbers we read from disk aren't in order, we know that they are
    the lowest (oldest) M numbers, and can perform some fairly quick
    sorts on them in memory to perform accurate culling.

    The question here, however, becomes: How wide should the bins be?
    What is the format on disk? If an item "moves bins" during normal
    operation, how long does the "update atime for slot #nnn"
    transaction take in the worst and average cases?

I am still relatively new at cache designs, so if anyone has some input 
or feedback on methods to quickly identify objects for culling, I am all 
ears! Feel free to point out articles, books, anything :]

Thanks,
--John Huston,
RH Westford Newbie