[libvirt] [RFC] Faster libvirtd restart with nwfilter rules

Thu Oct 11 11:37:50 UTC 2018

On 11.10.2018 13:20, Daniel P. Berrangé wrote:
> On Mon, Sep 24, 2018 at 10:41:37AM +0300, Nikolay Shirokovskiy wrote:
>> Hi, all.
>>
>>   On fat hosts which are capable to run hundreds of VMs restarting libvirtd
>> makes it's services unavailable for a long time if VMs use network filters. In
>> my tests each of 100 VMs has no-promisc [1] and no-mac-spoofing filters and
>> executing virsh list right after daemon restart takes appoximately 140s if no
>> firewalld is running (that is ebtables/iptables/ip6tables commands are used to
>> configure kernel tables).
> 
> Yep, this is not entirely surprising given the huge number of rules we try
> to create.
> 
>>   The problem is daemon does not even start to read from client connections
>> because state drivers are not initialized. Initialization is blocked in state
>> drivers autostart which grabs VMs locks. And VMs locks are hold by VMs
>> reconnection code. Each VM reloads network tables on reconnection and this
>> reloading is serialized on updateMutex in gentech nwfilter driver.
>> Workarounding autostart won't help much because even if state drivers will
>> initialize listing VM won't be possible because listing VMs takes each VM lock
>> one by one too. However managing VM that passed reconnection phase will be
>> possible which takes same 140s in worst case.
> 
> In the QEMU driver we call qemuProcesssReconnectAll(). This spawns one
> thread per VM that needs connecting to.
> 
> So AFAICT, state driver initialization should not be blocked. Libvirtd
> should accept commands, but any APIs that touch a virDomainObj will of
> course still be blocked until reconnect is completed.
> 
>>   Note that this issue is only applicable if we use filters configuration that
>> don't need ip learning. In the latter case situation is different because
>> reconnection code spawns new thread that apply network rules only after ip is
>> learned from traffic and this thread does not grab VM lock. As result VMs are
>> managable but reloading filters in background takes appoximately those same
>> 140s. I guess managing network filters during this period can have issues too.
> 
> I believe it should be possible to still change network filters while this
> reconnect is taking place - it would mean that any VMs affected by the filter
> change would have their iptables rules re-built a second time. However... 
> 
>> Anyway this situation does not look good so fixing the described issue by
>> spawning threads even without ip learning does not look nice to me.
> 
> this is tricky - we want to know filters are built before we allow the VM
> to start running, so we don't want to spawn a background thread in general.
> IP learning has special code that sets up a minimal safe ruleset before
> spawning the thread and waiting todo the real ruleset creation based on
> the learn IP address.  So I don't think it is desirable to change all
> rule creation to run in a background thread.
> 
>>   What speed up is possible on conservative approach? First we can remove for
>> test purpuses firewall ruleLock, gentech dirver updateMutex and filter object
>> mutex which do not serve function in restart scenario. This gives 36s restart
>> time. The speed up is archived because heavy fork/preexec steps are now run
>> concurrently.
> 
> To me the most obvious speedup is not to run any commands at all.
> 
> 99% of the time when we restart libvirtd and re-build network filters
> we are achieving nothing useful. We tear down the rules and replace
> them by exactly the same rules.
> 
> The idea behind rebuilding at startup is that someone might have changed
> the config on diks while libvirtd was not running, and we want to ensure
> that the VMs have the correct live config after this.
> 
> Alternatively libvirtd has been upgraded to a newer version, and we want
> to rebuild with the new code in case old code had a bug causing it to
> create incorrect rules.
> 
> I think we should look at how to optimize this to be more intelligent.
> 
> We could somehow record a hash of what rule contents was used
> originally. Only rebuild the rules if we see the hash of nwfilter
> rules has changed, or if libvirtd binary has had code changes.
> 
>> Next we can try to reduce fork/preexec time. To estimate its contibution alone
>> let's bring back the above locks. It turns out the most time takes fork itself
>> and closing 8k (on my system) file descriptors in preexec. Using vfork gives
>> 2x boost and so does dropping mass close. (I check this mass close contribution
>> because I not quite understand the purpose of this step - libvirt typically set
>> close-on-exec flag on it's descriptors). So this two optimizations alone can
>> result in restart time of 30s.
> 
> I don't like the idea of using vfork(). It is already hard to do stuff between
> fork() and execve() safely as you're restricted to async signal safe functions.
> vfork() places an even greater number of restrictions on code that can be
> run. It is especially dangerous in threaded programs as only the calling thread
> is suspended.
> 
> Unfortunately we cannot rely on close-on-exec. Depending on OS and/or version,
> clsoe-on-exec setting may or may not be atomic. eg
> 
>   open(...., O_CLOEXEC)
> 
> vs
> 
>   open(...)
>   fcntl(O_CLOSEXEC)
> 
> the later has a race we'd hit.
> 
> In addition libvirt links to a huge number of 3rd party libraries and I have
> essentially zero confidence in them setting O_CLOEXEC correctly all the time.
> 
> IOW, the mass close is inescapable IMHO.
> 
> The only thing I could see us doing is to spawn some minimalist helper process
> which then spawns iptables as needed. We'd only need the mass-close once for
> the helper process, which can then just spawn iptables without mass-close,
> or simply set a tiny max files ulimit for this helper process so that it
> only 'mass close' 20 FDs.
> 
> We'd have to write the set of desired iptables rules into some data format,
> send it to the helper to execute and wait for results. Probably not too
> difficult to achieve.
> 
>> Unfortunately combining the above two approaches does not give boost multiple
>> of them along. The reason is due to concurrency and high number of VMs (100)
>> preexec boost does not have significant role and using vfork dininishes
>> concurrency as it freezes all parent threads before execve. So dropping locks
>> and closes gives 33s restart time and adding vfork to this gives 25s restart
>> time.
>>
>> Another approach is to use --atomic-file option for ebtables
>> (iptables/ip6tables unfortunately does not have one). The idea is to save table
>> to file/edit file/commit table to kernel. I hoped this could give performance
>> boost because we don't need to load/store kernel network table for a single
>> rule update. In order to isolate approaches I also dropped all ip/ip6 updates
>> which can not be done this way. In this approach we can not drop ruleLock in
>> firewall because no other VM threads should change tables between save/commit.
>> This approach gives restart time 25s. But this approach is broken anyway as we
>> can not be sure another application doesn't change newtork table between
>> save/commit in which case these changes will be lost.
>>
>> After all I think we need to move in a different direction. We can add API to
>> all binaries and firewalld to execute many commands in one run. We can pass
>> commands as arguments or wrote them into file which is then given to binary.
> 
> This sounds like 'iptables-restore' command which accepts a batch of commands
> in a file. There's also ip6tables-restore and ebtables-restore.
> 
> There's no way to execute these via firewalld though.
> 
> It is also tricky when we need to check the output of certan commands
> before making a decision on which other commands to run. It would
> certainly speed things up though to load batches of rules at once.
> 
> 
> IMHO the most immediately useful thing would be to work on optimization
> to skip re-creating nwfilter rules on startup, unless we detect there
> are likely changes.
> 

Thanx! Quite simple and effective indeed. 

Nikolay