[libvirt] [nicsysco.com] Weird Libvirt Behavior

nico nico at nicsysco.com
Wed Jan 30 02:18:21 UTC 2019


Hi folks,

 

First time contributor, but I felt that what I discovered was (probably) a
very rare situation. 

 

I'm running a Centos server (my only Linux deployment) to which customers
all over the U.S. connect to process their micro-lender businesses. There
are several VM's, among other one which runs the fortress system, called a2.
In the beginning the .raw file was about 10GB, which was a 5X overkill in
terms of capacity, at the time. 

 

For years we had no problems and the CentOS box would tick over day after
day without as much as a hiccup. 

 

About three months ago a2 started to slow down, almost to the point of
timing out when applications and users log on. The band-aid was to copy an
earlier a2.raw backup over the current one on a regular basis, and it would
rectify the problem. At first applying this band-aid on Sunday nights, would
suffice. But, later we had to increase it to twice a week and these last
couple of weeks we had to do it almost every night. The system also sent
alerts that a "Degraded Array event had been detected on md device
/dev/md1". Inspecting the drives showed no crisis. 

 

Today it folded completely and brought the system down, with clients' "our
computers are down" response to their customers walking into their stores.
Restarting the box just brought a2 to a paused state, never recovering. We
had to killall to get rid of it.

 

Having nowhere else to go with it, I decided to rebuild a2 in another,
separate drive to at least address the degraded array alerts. As I edited
the .xml file, I saw the following:

 

<source file='/var/lib/libvirt/:machines/a2/a2-disk1.raw'/>

 

What the hell was that colon doing there? I checked the size of the .raw
file. It has grown to over 96GB. Just to check the sanity-box, I checked the
other VM's .xml files and they didn't have a colon, as I expected. 

 

I removed the colon and virsh-started a2, which fired up immediately, with
the rest of the system following suit. No doubt that ":" was the culprit!

 

My question is: Would that colon cause an append-action to the .raw file? We
have no idea when it got in there or how. We haven't worked on that xml file
for a long time. Why would a2 even fire up at all? 

 

It would be great to hear what the guru's think about that.

 

Thanks

 

Nico van Niekerk

Agoura Hills, CA 91301

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20190129/7dbd9561/attachment-0001.htm>


More information about the libvir-list mailing list