[Vdo-devel] VDO Optimisation [Window | Tuning | Online Re-Dedup | General Questions]

Thu Sep 5 09:43:14 UTC 2019

Hello again,

Apologies I made a few mistakes in my previous email.

* While each of the 100 22G file-sets increases by 100M a day, that data
should be almost entirely duplicate data (there may be some local metadata
difference between each process using each of those sets of files, but that
head is maximally 2.5M different per process and so maximally 250M
different at any given time). If data was different it would be 10G, but as
it’s all mostly the same I’d expect.. say 80% of that 100M to be duplicate
and so 2G total growth which if re-checked would go down even more after
the data-set is consolidated (by the process utilising it).
* I am still currently testing new paramters which is what I am seeking
guidance for or ways to optimise this setup to stop the creep of physical
storage consumption.

Regards,

Jordan
*Founder, astris.io <http://astris.io>*

*ABN: 24989321983*

On Thu, 5 Sep 2019 at 17:31, Jordan <jordan at astris.io> wrote:

> Hi everyone,
>
>
> I’m relatively new to VDO, I’ve been using it on some machines for about 2
> months now and I’m liking the results seen. On the setups I’ve got this
> running on so far I am using the following:
>
> CentOS 7.6.1810
> 3.10.0-957.21.3.el7.x86_64
>
> CPU: 4 cores, 8 threads
> Memory: 64 GB
> Physical devices: NVMe storage, 1 TB total
> Filesystem stack: Device -> LVM -> VDO -> XFS
>
> Here’s an example of `vdostats —human-readable`
>
> Device                    Size      Used Available Use% Space saving%
> /dev/mapper/vdo0        663.4G    489.5G    173.9G  73%           78%
>
> With a `df -h` for logical size
>
> Filesystem        Size  Used Avail Use% Mounted on
> /dev/nvme0n1p2     30G   11G   17G  40% /
> devtmpfs           32G     0   32G   0% /dev
> tmpfs              32G     0   32G   0% /dev/shm
> tmpfs              32G  896K   32G   1% /run
> tmpfs              32G     0   32G   0% /sys/fs/cgroup
> /dev/nvme0n1p1    488M  269M  195M  58% /boot
> /dev/mapper/vdo0   10T  2.1T  8.0T  21% /home
> tmpfs             6.3G     0  6.3G   0% /run/user/1200
>
> What’s bothering me so far is that this data set is 100 copies of 22G, so
> it should be roughly 2.2T which it is logically but VDO is maintaining
> 489.5G of physical space for this. Now in my original setup which these
> pasted numbers are I was using `dense` indexing with `0.25` indexMem, so my
> deduplication window was only 250G thus missing the tails ends of old data,
> correct?
>
> This data is almost entirely Berkley DB and Level DB files and new data is
> only written at the head, as per your presentation here:
> https://www.youtube.com/watch?v=7CGr5LEAfRY this data set is extremely
> ’tight’ in terms of temporal locality.
>
> Running something like fstrim() on the `/home` mount will clear up about
> 100G of physical space that VDO will quickly consume back up to it’s
> pre-fstrim size and then continue ticking along with.
>
> Finally, I know this data-set, individually that is, grows at a rate of
> about 100M a day so I should be seeing a 100M plus UDS index size increase
> per day, except I see several G and I don’t see roughly 220G of physical
> space usage, I see 489.G of usage.
>
> I adjusted my settings and did a test run on the same setup for all
> components except my VDO configurtation and used instead `sparse` indexing
> with 0.5G of indexMem for a 5 TB window, so as to encompass the entire 2.1T
> of logical space and I saw this increase from 73G on physical usage, that
> is the 50G base UDS index, plus the 22G physical size, up to 131.9G
> overnight. I nuked that setup and am not testing with new parameters.
>
> My goal here is to get get VDO to keep the physical usage low, around the
> level it should be at: roughly 220-250G. If I were to nuke and remake a
> machine now and recreate it with VDO it would function exactly correctly
> with roughly 220G of physical space usage, but would again increase so I
> know for a fact this extra space is unneeded data, and as per the YouTube
> presentation you state there are no jobs to go back in time and squeeze out
> deduplication (which would be handy as my system does have enough idle
> time); and I cannot find a way to force rebuld the UDS index or something
> like that while being online.
>
> So, I am not sure what to try now or if you have any advice for keeping
> physical usage low. Even 220G is on the higher end since these 100 22G
> copies are of the same files, cryptographically guaranteed to be exactly
> the same (blockchain data) which is why this usage creep is concerning me.
>
> Sorry for the very long email, I am still new and wanted to be thorough.
> I’ve included `vdo status` and `vdostats —verbose` for existing systems
> (with the creep) as well as the current test paramters I am trying out now
> as attachments to keep this wall of text minimal in size.
>
>
> Regards,
>
> Jordan
> *Founder, astris.io <http://astris.io>*
>
> *ABN: 24989321983*
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/vdo-devel/attachments/20190905/16125002/attachment.htm>