We recently had a few power outages at work, some scheduled, some not, and this played havoc with our xen servers.
One of the problems we had was that xend would not start (and thus xendomains would also not start).
Checking /var/log/xen/xend.log gave us the following snippet:
inst = XendNode()
File "/usr/lib/python2.5/site-packages/xen/xend/XendNode.py", line 164, in __init__
saved_pifs = self.state_store.load_state('pif')
File "/usr/lib/python2.5/site-packages/xen/xend/XendStateStore.py", line 104, in
load_state
dom = minidom.parse(xml_path)
File "xml/dom/minidom.py", line 1913, in parse
File "xml/dom/expatbuilder.py", line 924, in parse
File "xml/dom/expatbuilder.py", line 211, in parseFile
ExpatError: no element found: line 1, column 0
[2008-03-10 21:37:40 18122] INFO (__init__:1094) Xend exited with status 1.
A quick google of that error revealed several people that had come across the same problem, but no actual answer!
It looks like xen is having problems parsing an xml file, so some quick mental inspiration, and the find command, yielded /var/lib/xend/state/pif.xml which was a 0 byte file! A comparison to a working server showed that it should (or atleast could) contain this:
A copy and paste later and we had a working xend! However it refused to create any of the xenlets:
root@xen0:/etc/xen# xm create server0.cfg
Using config file "./server0.cfg".
Error: The privileged domain did not balloon!
Despite their being plenty of RAM!
root@xen0:/var/log/xen# xm list
Name ID Mem VCPUs State Time(s)
Domain-0 0 7928 8 r----- 832.8
root@xen0:/var/log/xen# free
total used free shared buffers cached
Mem: 8119416 393028 7726388 0 11344 58832
-/+ buffers/cache: 322852 7796564
Swap: 15631224 0 15631224
An strace of the process revealed xen did think it had less memory available than it actually had ..
[2008-03-10 21:47:48 18620] DEBUG (__init__:1094) Balloon: 131064 KiB free; 0 to scrub;
need 524288; retries: 20.
As we had a working xend finally we decided to implement a technique we’d learned from working with Windows machines and rebooted the server. This magically fixed the memory issue, it would have been nice to know what actually caused it and if there was a proper fix though.