virsh vol-download uses a lot of memory

Michal Privoznik mprivozn at redhat.com
Thu Jan 23 09:22:33 UTC 2020


On 1/22/20 1:18 PM, Daniel P. Berrangé wrote:
> On Wed, Jan 22, 2020 at 01:01:42PM +0100, Michal Privoznik wrote:
>> On 1/22/20 11:11 AM, Michal Privoznik wrote:
>>> On 1/22/20 10:03 AM, R. Diez wrote:
>>>> Hi all:
>>>>
>>>> I am using the libvirt version that comes with Ubuntu 18.04.3 LTS.
>>>
>>> I'm sorry, I don't have Ubuntu installed anywhere to look the version
>>> up. Can you run 'virsh version' to find it out for me please?
>>
>> Nevermind, I've managed to reproduce with the latest libvirt anyway.
>>
>>>
>>>>
>>>> I have written a script that backs up my virtual machines every
>>>> night. I want to limit the amount of memory that this backup
>>>> operation consumes, mainly to prevent page cache thrashing. I have
>>>> described the Linux page cache thrashing issue in detail here:
>>>>
>>>> http://rdiez.shoutwiki.com/wiki/Today%27s_Operating_Systems_are_still_incredibly_brittle#The_Linux_Filesystem_Cache_is_Braindead
>>>>
>>>>
>>>> The VM virtual disk weighs 140 GB at the moment. I thought 500 MiB
>>>> of RAM should be more than enough to back it up, so I added the
>>>> following options to the systemd service file associated to the
>>>> systemd timer I am using:
>>>>
>>>>     MemoryLimit=500M
>>>>
>>>> However, the OOM is killing "virsh vol-download":
>>>>
>>>> Jan 21 23:40:00 GS-CEL-L kernel: [55535.913525] [  pid  ]   uid
>>>> tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
>>>> Jan 21 23:40:00 GS-CEL-L kernel: [55535.913527] [  13232]  1000
>>>> 13232     5030      786    77824      103             0
>>>> BackupWindows10
>>>> Jan 21 23:40:00 GS-CEL-L kernel: [55535.913528] [  13267]  1000
>>>> 13267     5063      567    73728      132             0
>>>> BackupWindows10
>>>> Jan 21 23:40:00 GS-CEL-L kernel: [55535.913529] [  13421]  1000
>>>> 13421     5063      458    73728      132             0
>>>> BackupWindows10
>>>> Jan 21 23:40:00 GS-CEL-L kernel: [55535.913530] [  13428]  1000
>>>> 13428 712847   124686  5586944   523997             0 virsh
>>>> Jan 21 23:40:00 GS-CEL-L kernel: [55535.913532] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/system.slice/VmBackup.service,task_memcg=/system.slice/VmBackup.service,task=virsh,pid=13428,uid=1000
>>>>
>>>> Jan 21 23:40:00 GS-CEL-L kernel: [55535.913538] Memory cgroup out of
>>>> memory: Killed process 13428 (virsh) total-vm:2851388kB,
>>>> anon-rss:486180kB, file-rss:12564kB, shmem-rss:0kB
>>>>
>>>> I wonder why "virsh vol-download" needs so much RAM. It does not get
>>>> killed straight away, it takes a few minutes to get killed. It
>>>> starts using a VMSIZE of around 295 MiB, which is not really frugal
>>>> for a file download operation, but then it grows and grows.
>>>
>>> This is very likely a memory leak somewhere.
>>
>> Actually, it is not. It's caused by our design of the client event loop. If
>> there are any incoming data, read as much as possible placing them at the
>> end of linked list of incoming stream data (stream is a way that libvirt
>> uses to transfer binary data). Problem is that instead of returning NULL to
>> our malloc()-s once the limit is reached, kernel decides to kill us.
>>
>> For anybody with libvirt insight: virNetClientIOHandleInput() ->
>> virNetClientCallDispatch() -> virNetClientCallDispatchStream() ->
>> virNetClientStreamQueuePacket().
>>
>>
>> The obvious fix would be to stop processing incoming packets if stream has
>> "too much" data cached (define "too much"). But this may lead to
>> unresponsive client event loop - if the client doesn't pull data from
>> incoming stream fast enough they won't be able to make any other RPC.
> 
> IMHO if they're not pulling stream data and still expecting to make
> other RPC calls in a timely manner, then their code is broken.

This is virsh that we are talking about. It's not some random application.

And I am able to limit virsh mem usage to "just" 100MiB with one well 
placed usleep() - to slow down putting incoming stram packets onto the 
queue:

diff --git i/src/rpc/virnetclientstream.c w/src/rpc/virnetclientstream.c
index f904eaba31..cfb3f225f2 100644
--- i/src/rpc/virnetclientstream.c
+++ w/src/rpc/virnetclientstream.c
@@ -358,6 +358,7 @@ int 
virNetClientStreamQueuePacket(virNetClientStreamPtr st,
      virNetClientStreamEventTimerUpdate(st);

      virObjectUnlock(st);
+    usleep(1000);
      return 0;
  }


But any attempt I've made to ignore POLLIN if stream queue is longer 
than say 8 packets was unsuccessful (the code still read incoming 
packets and placed them into the queue). I blame passing the bucket 
algorithm for that (rather than my poor skills :-P).

> 
> Having said that, in retrospect I rather regret ever implementing our
> stream APIs as we did. We really should have just exposed an API which
> lets you spawn an NBD server associated with a storage volume, or
> tunnelled NBD over libvirtd. The former is probably our best strategy
> these days, now that NBD has native TLS support.

Yeah, but IIRC NBD wasn't a thing back then, was it?

Michal




More information about the libvirt-users mailing list