[libvirt-users] concurrent migration of several domains rarely fails

Fri Dec 7 18:34:45 UTC 2018

On 12/6/18 10:12 AM, Lentes, Bernd wrote:
> 
>> Hi,
>>
>> i have a two-node cluster with several domains as resources. During testing i
>> tried several times to migrate some domains concurrently.
>> Usually it suceeded, but rarely it failed. I found one clue in the log:
>>
>> Dec 03 16:03:02 ha-idg-1 libvirtd[3252]: 2018-12-03 15:03:02.758+0000: 3252:
>> error : virKeepAliveTimerInternal:143 : internal error: connection closed due
>> to keepalive timeout
>>
>> The domains are configured similar:
>> primitive vm_geneious VirtualDomain \
>>         params config="/mnt/san/share/config.xml" \
>>         params hypervisor="qemu:///system" \
>>         params migration_transport=ssh \
>>         op start interval=0 timeout=120 trace_ra=1 \
>>         op stop interval=0 timeout=130 trace_ra=1 \
>>         op monitor interval=30 timeout=25 trace_ra=1 \
>>         op migrate_from interval=0 timeout=300 trace_ra=1 \
>>         op migrate_to interval=0 timeout=300 trace_ra=1 \
>>         meta allow-migrate=true target-role=Started is-managed=true \
>>         utilization cpu=2 hv_memory=8000
>>
>> What is the algorithm to discover the port used for live migration ?
>> I have the impression that "params migration_transport=ssh" is worthless, port
>> 22 isn't involved for live migration.
>> My experience is that for the migration tcp ports > 49151 are used. But the
>> exact procedure isn't clear for me.
>> Does live migration uses first tcp port 49152 and for each following domain one
>> port higher ?
>> E.g. for the concurrent live migration of three domains 49152, 49153 and 49154.
>>
>> Why does live migration for three domains usually succeed, although on both
>> hosts just 49152 and 49153 is open ?
>> Is the migration not really concurrent, but sometimes sequential ?
>>
>> Bernd
>>
> Hi,
> 
> i tried to narrow down the problem.
> My first assumption was that something with the network between the hosts is not ok.
> I opened port 49152 - 49172 in the firewall - problem persisted.
> So i deactivated the firewall on both nodes - problem persisted.
> 
> Then i wanted to exclude the HA-Cluster software (pacemaker).
> I unmanaged the VirtualDomains in pacemaker and migrated them with virsh - problem persists.
> 
> I wrote a script to migrate three domains sequentially from host A to host B and vice versa via virsh.
> I raised up the loglevel from libvirtd and found s.th. in the log which may be the culprit:
> 
> This is the output of my script:
> 
> Thu Dec  6 17:02:53 CET 2018
> migrate sim
> Migration: [100 %]
> Thu Dec  6 17:03:07 CET 2018
> migrate geneious
> Migration: [100 %]
> Thu Dec  6 17:03:16 CET 2018
> migrate mausdb
> Migration: [ 99 %]error: operation failed: migration job: unexpectedly failed    <===== error !
> 
> Thu Dec  6 17:05:32 CET 2018      <======== time of error
> Guests on ha-idg-1: \n
>   Id    Name                           State
> ----------------------------------------------------
>   1     sim                            running
>   2     geneious                       running
>   -     mausdb                         shut off
> 
> migrate to ha-idg-2\n
> Thu Dec  6 17:05:32 CET 2018
> 
> This is what journalctl told:
> 
> Dec 06 17:05:32 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:32.481+0000: 12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=0 idle=30
> Dec 06 17:05:32 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:32.481+0000: 12553: error : virKeepAliveTimerInternal:143 : internal error: connection closed due to keepalive timeout
> Dec 06 17:05:32 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:32.481+0000: 12553: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x55b2bb937740
> 
> Dec 06 17:05:27 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:27.476+0000: 12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=1 idle=25
> Dec 06 17:05:27 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:27.476+0000: 12553: info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740 client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1
> 
> Dec 06 17:05:22 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:22.471+0000: 12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=2 idle=20
> Dec 06 17:05:22 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:22.471+0000: 12553: info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740 client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1
> 
> Dec 06 17:05:17 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:17.466+0000: 12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=3 idle=15
> Dec 06 17:05:17 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:17.466+0000: 12553: info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740 client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1
> 
> Dec 06 17:05:12 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:12.460+0000: 12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=4 idle=10
> Dec 06 17:05:12 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:12.460+0000: 12553: info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740 client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1
> 
> Dec 06 17:05:07 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:07.455+0000: 12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=5 idle=5
> Dec 06 17:05:07 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:07.455+0000: 12553: info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740 client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1
> 
> There seems to be a kind of a countdown. From googleing i found that this may be related to libvirtd.conf:
> 
> # Keepalive settings for the admin interface
> #admin_keepalive_interval = 5
> #admin_keepalive_count = 5
> 
> What is meant by the "admin interface" ? virsh ?

virsh-admin, which you can use to change some admin settings of libvirtd, e.g. 
log_level. You are interested in the keepalive settings above those ones in 
libvirtd.conf, specifically

#keepalive_interval = 5
#keepalive_count = 5

> What is meant by "client" in libvirtd.conf ? virsh ?

Yes, virsh is a client, as is virt-manager or any application connecting to 
libvirtd.

> Why do i have regular timeouts although my two hosts are very performant ? 128GB RAM, 16 cores, 2 1GBit/s network adapter on each host in bonding.
> During migration i don't see much load, although nearly no waiting for IO.

I'd think concurrently migrating 3 VMs on a 1G network might cause some 
congestion :-).

> Should i set admin_keepalive_interval to -1 ?

You should try 'keepalive_interval = -1'. You can also avoid sending keepalive 
messages from virsh with the '-k' option, e.g. 'virsh -k 0 migrate ...'.

If this doesn't help, are you in a position to test a newer libvirt, preferably 
master or the recent 4.10.0 release?

Regards,
Jim