We use oVirt at work for our virtualisation platform and subsequently I've started using it for my homelab as well. I recently rebuilt the deployment at work and have taken on the "oVirt Guy" role, slowly shaping it into an automated hosting environment which is used for internal and external customer purposes. I've had a bit of a rough time over the last couple of weeks getting a few things working which has prompted me to write some thoughts about oVirt overall and specifically the pain points I ran into.
This post was accurate as of oVirt 3.6.5.
As a whole I've very happy with oVirt. It's a fantastic project and I'm incredibly grateful to RedHat and the rest of the community for getting it where it is. For a free software project it really is brilliant. I find it difficult to publicly critisise oVirt, given that I get so much for free and it's "free as in speech" as well. It manages our small scale VM environment perfectly and is easy to understand and manage.
My history involves deploying a redesigned implementation of an existing oVirt cluster with version 3.4 and later upgrading it to 3.5 and 3.6.
This is a massive pro that oVirt has going for it, the mailing list and IRC channel are great resources. In the handful of times I've sent a post out to the mailing list RedHat employees have responded back to me in a helpful manner, often resolving an issue immediately or pointing me to a bug ticket which I follow along with until the patch has been released. There's a lot of useful discussion happening on the mailing list which keeps me subscribed and I do my best to help others where I can. The IRC channel is also fairly active, I used to idle there every workday and would get good discussion.
I do feel that oVirt is under-represented within the Reddit community, very infrequently is it mentioned on /r/sysadmin and /r/ovirt is basically dead. Not exactly a problem given the resources available on the mailing list but just a little sad to see.
Patches are very welcome from outside RedHat. I was pleased when the team gladly accepted my patch. It needed substantial refactoring which I never got around to doing and someone kindly took it on and got the patch merged for me. This is something you certainly don't get with a closed source system!
All things considered, our install has been rock solid. It's been in production for about 6 months now and I cannot remember a time where an update has broken something or it blew up without me causing the issue. I definitely come across bugs with new features, seemingly things aren't tested as fully as they could be - but I'm very happy when I get something working that I can be almost certain it's going to keep working.
I haven't had to worry about minor version updates and I've pretty much applied them as soon as I can get a maintenance window (after testing it first of course). We'll soon see how the 4.0 release goes later this Summer though, as that'll be my first major version upgrade.
New features do have teething problems though and I've seen this many times though the minor updates. When new feature lands it can take .2 or .3 patch releases before it is working properly, which sadly leaves me almost always saying to my boss that we need to wait for the next release before I can finish up a particular task. First it was the engine running on FC storage, then snapshots getting stuck in an "Illegal" state and recently the console WebSocket Proxy not working for VNC displays (which I also blogged about). It isn't the end of the world and I'll try and do my bit to test out RC releases to a greater extent to see if a few more bugs can be caught before a release.
Less good stuff
Documentation - what's that?
I'm in two minds about the documentation of the project, in some instances I'm over the moon that someone clearly took a lot of effort to produce a detailed write-up about a particular feature or workflow but a lot of the time I seem to be reaching page 10 of Google without hitting anything good. The REST API in particular seems to just not be documented to any extent at all. I've relied solely on hitting the endpoints and seeing what happens or what error messages I get back. The RSDL does a reasonable job of documenting the available endpoints and their parameters but it has nothing on a documentation website like any popular REST API has.
Some time during the last couple of months the team moved oVirt to a new website, including the wiki. It definitely looks nicer and loads faster but the wiki (which is the main source of documentation) is now just totally fucked. Many links I have saved in my own documentation are now dead, there's dead links all over older wiki pages, the new wiki doesn't have anchor tags on each heading (so direct links to a section of a page no longer work), pages lack any table of contents at the top and (to me at least) the styling is much harder to read. How someone got the thumbs up to do this I don't know. Just look at the Administration Guide, the first thing linked to after the quick start guide. It is 129 pages of A4 and doesn't have a table of contents or allow direct linking to sections. Both things that used to work beautifully on the old wiki system. It was deliberately made worse, and that winds me up. Thankfully the entire old website is still available although I'm not sure for how much longer.
I'm frequently disappointed by how error messages are reported. A lot of the time errors can be very generic, for instance
A host failed to connect to a storage domain leaves me asking "Which storage domain!?" and requires digging into log files. On a reasonably frequent basis I see an exclamation mark next to a host or a VM with absolutely no indication of what it is trying to alert me to.
Just today I hit a problem with a host not being able to activate and getting stuck in "Connecting" - the web interface gave no indication as to what was wrong and the logs mentioned something along the lines of being unable to acquire a migration domain. Turns out that I (stupidly, yes) forgot to attach the migration network link to the host; my fault obviously, but the errors could have probably helped me reach that conclusion a bit quicker.
On the topic of log files, they are very heavily utilised. There's a log file for every separate component of the system (VDSM, HA agent, HA broker, engine, etc.) and a few of them are filling with tens of lines per second of informational and debugging messages. This makes spotting errors hard, especially when you aren't sure exactly which component could be the one that is actually producing the error. Hint: often the
vdsm.log file is a great place to start!
The fact that there are so many different log files isn't really a complaint, but it would be nice if we could have a generic error log which could leave pointers to the other files whenever a full error occurs. Or maybe I should just have a proper remote logging system which could interpret the errors! The informational messages can probably also be switched off, which should - in my opinion - be the default for a production install.
Some assembly required
In a homelab environment everything pretty much just works right out of the box. However, in our commercial hosting environment it's a different story (regarding live migrations in particular). Something that I didn't expect to have to do is fiddle with tuning parameters to be able to reliably live migrate VMs which are doing any reasonable amount of work. In our deployment we have half a dozen VMs which would just never successfully migrate to another host due to activity on them. These were generally web servers serving a large amount of requests or databases under pretty heavy load. Nothing that is particularly unexpected in a business setting though. I had to go about tuning a handful of parameters relating to the number of concurrent migrations allowed, the maximum bandwidth, the timeout and allowed downtime. Thankfully, a super response from a RedHat employee (who's name has escaped me - sorry), told me exactly which settings should be changed as well as having a guess as to what I should set them to. His suggestion happened to be spot on and works great for me. The issue though was that in order to restart VDSM and get the new config file read, the host has to be in maintenance mode and therefore not running any VMs. Can you see the catch-22 here? It was a disappointment, but I'm happy I got it working. I plan to blog about this issue in more depth in future.
Something I still haven't figured out though is whether there is any way to tell the migration agent that I just want something migrated as fast as possible with up to X seconds of downtime. As of right now, it appears to slowly scale up the allowed downtime (up to my set maximum) which can take a rather long time. In some instances I just want a VM moved as soon as possible and I don't care if it goes down for 10 seconds, instead of it trying to do its best to move it with as little downtime as possible. Maybe this is an edge case here but it would make for a handy addition.
This is something that isn't really a fault of the oVirt team. There is a very distinct lack of other products and tools which integrate with oVirt. Want to use a tool like Terraform to manage your machines on oVirt? Tough. Need a block level backup program for your VMs? Tough (that is if you don't like Acronis). It seems that VMWare and Hyper-V have won the enterprise virtualisation race (for now at least) so everything is built for those two and not oVirt. Other than the odd script on GitHub and Acronis, I've not actually come across any product which actually integrates with oVirt at all.
Hopefully we'll see this change over the next couple of years, and I might get to use Veeam like all the cool kids do with their VMWare and Hyper-V deployments. I would guess we're where we are just because oVirt is less popular and obviously a much smaller market for commercial vendors, but it is something I can image will change as oVirt picks up steam.
It turns out that this post has been a lot more negative than I intended before writing it. And I feel it necessary to reiterate what I wrote at the beginning, that oVirt is an amazing project. You get so much for the grand price of $0 and despite all the nits I've picked I would argue that the support I can get on the mailing list from RedHat employees nearly makes up for it all. Except the fucking wiki migration.
oVirt is great. I wouldn't think twice about using it again in a homelab and it would definitely be at the top of my list for deployment in an enterprise. If you haven't already, give oVirt a try.
I'm just glad we didn't purchase RHEV. (Having zero experience with the RHEV support team) I don't think we would see value for money and would prefer to go with VMWare or Hyper-V, even if those are more expensive.
I'll formalise the negative remarks I've made into more constructive bugs and feature requests to go into Bugzilla within the coming days/weeks.
I welcome all comments and questions by email. My address is on the homepage of this site.