[libvirt-users] concurrent migration of several domains rarely fails
Lentes, Bernd
bernd.lentes at helmholtz-muenchen.de
Thu Dec 6 17:12:59 UTC 2018
> Hi,
>
> i have a two-node cluster with several domains as resources. During testing i
> tried several times to migrate some domains concurrently.
> Usually it suceeded, but rarely it failed. I found one clue in the log:
>
> Dec 03 16:03:02 ha-idg-1 libvirtd[3252]: 2018-12-03 15:03:02.758+0000: 3252:
> error : virKeepAliveTimerInternal:143 : internal error: connection closed due
> to keepalive timeout
>
> The domains are configured similar:
> primitive vm_geneious VirtualDomain \
> params config="/mnt/san/share/config.xml" \
> params hypervisor="qemu:///system" \
> params migration_transport=ssh \
> op start interval=0 timeout=120 trace_ra=1 \
> op stop interval=0 timeout=130 trace_ra=1 \
> op monitor interval=30 timeout=25 trace_ra=1 \
> op migrate_from interval=0 timeout=300 trace_ra=1 \
> op migrate_to interval=0 timeout=300 trace_ra=1 \
> meta allow-migrate=true target-role=Started is-managed=true \
> utilization cpu=2 hv_memory=8000
>
> What is the algorithm to discover the port used for live migration ?
> I have the impression that "params migration_transport=ssh" is worthless, port
> 22 isn't involved for live migration.
> My experience is that for the migration tcp ports > 49151 are used. But the
> exact procedure isn't clear for me.
> Does live migration uses first tcp port 49152 and for each following domain one
> port higher ?
> E.g. for the concurrent live migration of three domains 49152, 49153 and 49154.
>
> Why does live migration for three domains usually succeed, although on both
> hosts just 49152 and 49153 is open ?
> Is the migration not really concurrent, but sometimes sequential ?
>
> Bernd
>
Hi,
i tried to narrow down the problem.
My first assumption was that something with the network between the hosts is not ok.
I opened port 49152 - 49172 in the firewall - problem persisted.
So i deactivated the firewall on both nodes - problem persisted.
Then i wanted to exclude the HA-Cluster software (pacemaker).
I unmanaged the VirtualDomains in pacemaker and migrated them with virsh - problem persists.
I wrote a script to migrate three domains sequentially from host A to host B and vice versa via virsh.
I raised up the loglevel from libvirtd and found s.th. in the log which may be the culprit:
This is the output of my script:
Thu Dec 6 17:02:53 CET 2018
migrate sim
Migration: [100 %]
Thu Dec 6 17:03:07 CET 2018
migrate geneious
Migration: [100 %]
Thu Dec 6 17:03:16 CET 2018
migrate mausdb
Migration: [ 99 %]error: operation failed: migration job: unexpectedly failed <===== error !
Thu Dec 6 17:05:32 CET 2018 <======== time of error
Guests on ha-idg-1: \n
Id Name State
----------------------------------------------------
1 sim running
2 geneious running
- mausdb shut off
migrate to ha-idg-2\n
Thu Dec 6 17:05:32 CET 2018
This is what journalctl told:
Dec 06 17:05:32 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:32.481+0000: 12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=0 idle=30
Dec 06 17:05:32 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:32.481+0000: 12553: error : virKeepAliveTimerInternal:143 : internal error: connection closed due to keepalive timeout
Dec 06 17:05:32 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:32.481+0000: 12553: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x55b2bb937740
Dec 06 17:05:27 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:27.476+0000: 12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=1 idle=25
Dec 06 17:05:27 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:27.476+0000: 12553: info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740 client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1
Dec 06 17:05:22 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:22.471+0000: 12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=2 idle=20
Dec 06 17:05:22 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:22.471+0000: 12553: info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740 client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1
Dec 06 17:05:17 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:17.466+0000: 12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=3 idle=15
Dec 06 17:05:17 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:17.466+0000: 12553: info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740 client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1
Dec 06 17:05:12 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:12.460+0000: 12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=4 idle=10
Dec 06 17:05:12 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:12.460+0000: 12553: info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740 client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1
Dec 06 17:05:07 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:07.455+0000: 12553: info : virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740 client=0x55b2bb930d50 countToDeath=5 idle=5
Dec 06 17:05:07 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:07.455+0000: 12553: info : virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740 client=0x55b2bb930d50 prog=1801807216 vers=1 proc=1
There seems to be a kind of a countdown. From googleing i found that this may be related to libvirtd.conf:
# Keepalive settings for the admin interface
#admin_keepalive_interval = 5
#admin_keepalive_count = 5
What is meant by the "admin interface" ? virsh ?
What is meant by "client" in libvirtd.conf ? virsh ? Why do i have regular timeouts although my two hosts are very performant ? 128GB RAM, 16 cores, 2 1GBit/s network adapter on each host in bonding.
During migration i don't see much load, although nearly no waiting for IO.
Should i set admin_keepalive_interval to -1 ?
Bernd
Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDirig.in Petra Steiner-Hoffmann
Stellv.Aufsichtsratsvorsitzender: MinDirig. Dr. Manfred Wolter
Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler, Dr. rer. nat. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671
More information about the libvirt-users
mailing list