2 years later... backups

Mon Jul 10 20:17:05 UTC 2006

Roberto Ragusa <mail at robertoragusa.it> writes:

> One day I decided to remove some old backups by launching
> an rm command for each snapshot directory in parallel.
> I then realized that there were more than 1000 directories,
> and the total number of files to be deleted was around
> 100 million.
> It took some time, but everything went fine; not a bad
> stress test for the machine (reiserfs/LVM2/nv_sata)
> I had never seen a load average above 1000 until then.
> :-)
>
> There is only one thing I'd like to improve: renamed
> or moved files are seen as new files and are not hardlinked.
> I didn't try if "--fuzzy" works for hardlinking too.

I have put my "just another backup solution" on
http://www.kernel.org/~chris/cbackup-1.05.tar.gz
Probably someone would find it useful.

I'm using it for about a year without problems. This is basically
disk backup, using SHA1 to avoid storing duplicates. I'm using
a single NFS-exported disk for backing up a set of machines.
Output files are: a single big compressed archive + a single index
file per backup session. A single "hash list" is being kept and
updated for the whole set of backups.

Identical files (on different machines and/or in different places)
take the space only once - the first time SHA1 check is performed
and then mtime/ctime/inode is checked (or mtime/ctime/name if inode
numbers aren't stable).

There is no concept of incremental backup here: every index file
contains a complete list of files (but it references actual data
contained in current and previous *.arc archives).

The index file is a text, every line contains type, size, inode,
hash, mode, etc. and name (terminated by 0 rather than \n). That
enables use of normal text utils (comparing different backups
etc).

It uses a concept of SHA1 being an index, similar to git.

What I usually do is:
# backup -v -v -i hash_list -i last_backup_for_this_machine.idx -a *.arc \
-oi new_backup_for_this_machine.idx -o new_backup_etc.arc -oh hash_list.new

While currently all archive files (*.arc) should be accessible
during backup and restore, it could be trivially modified to remove
this restriction (i.e., archives could span multiple media, and
be processed sequentially while restoring).

There is one noticeable restriction: the complete hash list and
file data from previous backup (+ file names if inode numbers
are ignored) need to be kept in memory for the duration of backup
session. That means it will use several MB of memory for backing
up, say, a single million of files.

Possible parameters:
Usage: backup [options] [--] file...      Back-up
       backup --restore [options]         Restore
       backup --stats [options]           Statistics

options:
   -i index...             Read index from file 'index'
   -a archive...           Read archive from file 'archive'
   -t                      Test file SHA-1 hashes
   -v...                   Verbose

backup options:
   -1                      One filesystem
   -x exclude...           Exclude files and directories
   -oi index               Output index to file 'index'
   -oa archive             Output archive to file 'archive'
   -oh hashes              Output hash list to file 'index'
   --ignore-inodes         Ignore inode numbers (for FAT)

restore options:
   -s...                   Strip one directory component
   -n...                   do not restore
-- 
Krzysztof Halasa