[libvirt] qemu: migration: continuously sending and receiving ARP packets guest mistakenly thinks there's another guest with the same IP.

zhang bo oscar.zhangbo at huawei.com
Fri Apr 3 09:45:07 UTC 2015


Problem Description:
Live-migrate a guest, which has a tap device and continuously sends and receives ARP packets, it would mistakenly think there's another guest with the same IP, immedially after migration.

The steps to reproduce the problem:
1 define and start a domain with its network configured as:
    <interface type='bridge'>
      <mac address='52:54:00:7d:b0:af'/>
      <source bridge='br0'/>
      <virtualport type='openvswitch'>
        <parameters interfaceid='e4ad3dbb-7808-4175-83ee-ee0cba1c5456'/>
      </virtualport>
      <model type='virtio'/>
      <driver name='vhost' queues='4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
2 The guest sends ARP packets continuously: arping -I ethX xx.xx.xx.xx(self_ip)
3 Meanwhile, the guest also receives ARP packets continuously: tcpdump -i ethX arp host xx.xx.xx.xx(self_ip) -entttt
4 After migrateion, at the dest side,  the guest gets a lot of ARP packets which came from the source-side guest(which was stored while it's suspended.).
For example:
2015-03-27 16:45:56.695166 52:54:00:7d:b0:af > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: arp who-has 9.61.108.208 (ff:ff:ff:ff:ff:ff) tell 9.61.108.208
2015-03-27 16:45:56.695197 52:54:00:7d:b0:af > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: arp who-has 9.61.108.208 (ff:ff:ff:ff:ff:ff) tell 9.61.108.208
2015-03-27 16:45:56.695205 52:54:00:7d:b0:af > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: arp who-has 9.61.108.208 (ff:ff:ff:ff:ff:ff) tell 9.61.108.208
2015-03-27 16:45:56.695214 52:54:00:7d:b0:af > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: arp who-has 9.61.108.208 (ff:ff:ff:ff:ff:ff) tell 9.61.108.208
2015-03-27 16:45:56.695244 52:54:00:7d:b0:af > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: arp who-has 9.61.108.208 (ff:ff:ff:ff:ff:ff) tell 9.61.108.208
2015-03-27 16:45:56.695256 52:54:00:7d:b0:af > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: arp who-has 9.61.108.208 (ff:ff:ff:ff:ff:ff) tell 9.61.108.208
2015-03-27 16:45:56.695264 52:54:00:7d:b0:af > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: arp who-has 9.61.108.208 (ff:ff:ff:ff:ff:ff) tell 9.61.108.208
2015-03-27 16:45:56.695291 52:54:00:7d:b0:af > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: arp who-has 9.61.108.208 (ff:ff:ff:ff:ff:ff) tell 9.61.108.208
2015-03-27 16:45:56.695324 52:54:00:7d:b0:af > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: arp who-has 9.61.108.208 (ff:ff:ff:ff:ff:ff) tell 9.61.108.208
2015-03-27 16:45:56.695337 52:54:00:7d:b0:af > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: arp who-has 9.61.108.208 (ff:ff:ff:ff:ff:ff) tell 9.61.108.208
2015-03-27 16:45:56.695344 52:54:00:7d:b0:af > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: arp who-has 9.61.108.208 (ff:ff:ff:ff:ff:ff) tell 9.61.108.208
2015-03-27 16:45:56.695364 52:54:00:7d:b0:af > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: arp who-has 9.61.108.208 (ff:ff:ff:ff:ff:ff) tell 9.61.108.208
5 These packets will confuse my process. It may think that there is another VM has the same IP with itself.

Reasons for the problem:
The tap device will get up  as soon as it's created(in virNetDevTapCreateInBridgePort), before the cpus got un-paused.
So, it kept receiving data before the guest starts to run, please note that the data are sent from the source side.
As soon as the guest get running, it parses the data stored before, and thinks they were from other guest with the same IP, which is in fact the guest from the source side.




There was a patch "network: Bring netdevs online later", it move the virNetDevSetOnline() of network device
just before start VM's vcpu. But in the Laine Stump replay mail say "It turns out, though, that regular tap
devices which will be connected to a bridge should be ifup'ed and attached to the bridge as soon as possible,
so that the forwarding delay timer of the bridge can start to count down."

I agree with Laine Stump's idea that it's not a perfect solution to start the tap device right before running vcpu.
so, here comes the question:
what can we do to insure our guests not receive itself's ARP packets from src side during migrateion?




More information about the libvir-list mailing list