There are four rings in x86, 0, 1, 2, and 3, with the lower numbers being higher privilege. A simple way to think about it is that a program running at a given ring can not change things running at a lower numbered ring, but something running at a low ring can mess with a higher numbered ring.
In practice, only rings 0 and 3, the highest and lowest, are commonly used. OSes typically run in ring 0 while user programs are in ring 3. One of the ways the 64-bit extensions to x86 'clean up' the ISA is by losing the middle rings, 1 and 2. Pretty much no one cared that they are gone, except the virtualization folk.
Virtual Machines (VM) like VMware obviously have to run in ring 0, but if they want to maintain complete control, they need to keep the OS out of ring 0. If a runaway task can overwrite the VM, it kind of negates half the reason you want it in the first place. The obvious solution is to force the hosted OS to run in a lower ring, like ring 1.
This would be all fine and dandy except that the OSes are used to running in ring 0, and having complete control of the system. They are set to go from 0 to 3, not 1 to 3. In a PV environment, you change the OS so it plays nice. If you are going for the complete solution, you have to force it into ring 1.
The problem here is that some instructions will only work if they are going to or from ring 0, and other will behave oddly if not in the right ring. I mean 'oddly' in a computer way, IE a euphemism for really really bad things will happen if you try this; wear a helmet. It does prevent the hosted OS from trashing the VM, and also prevents the programs on the hosted OS from trashing the OS itself, or worse yet, the VM. This is the '0/1/3' model.
The other model is called the '0/3' model. It puts the VM in ring 0 and both the OS and programs all in ring 3, but essentially it does the rest of the things like the 0/1/3 model. The deprivileged OS in ring 3 can be walked on by user programs with much greater ease, but since there are not ring traversals, an expensive operation, it can run a little faster. Speed for security.
Another slightly saner way to do 0/3 is to have the CPU maintain two sets of page tables for things running in ring 0. One set of page tables would be for the OS, the other set for the old ring 3 programs. This way you have a fairly robust set of memory protections to keep user programs out of OS space, and the other way around. Of course this will once again cost performance, just in a different way, resulting in the other end of the speed vs security tradeoff.
To sum it all up, in 0/1/3, you have security, but take a hit when changing from 3 to 1, 3 to 0, or 1 to 0, and back again. In 0/3, you have only the 0 to 3 transition, so it could potentially run faster than a non-hosted OS. If you have a problem, the 0/3 model is much more likely to come down around your ears in a blue screen than the 0/1/3 model. The future is 0/3 though, mainly because, as I said earlier, 64-bit extensions do away with rings 1 and 2, so you are forced to the 0/3 model. That is, in computer terms, called progress, much in the same way devastating crashes are considered 'odd'.
On paper this seems like the perfect thing to do, if you can live with a little more instability, or in the case of 0/1/3, a little speed loss. There are drawbacks though, and they broadly fall into four categories of 'odd'. The first is instructions that check the ring they are in, followed by instructions that do not save the CPU state correctly when in the wrong ring. The last two are dead opposites, instructions that do not cause a fault when they should, and others that fault when they should not, and fault a lot. None of these make a VM writer's life easier, nor do they speed anything up.
The first one is the most obvious, an instruction that checks the ring it is in. If you deprivilege an OS to ring 1, and it checks where it is running, it will return 1 not 0. If a program expects to be in 0, it will probably take the 1 as an error, probably a severe error. This leads to the user seeing blue, a core dump, or another form of sub-optimal user experience. Binary Translation can catch this and fake a 0, but that means tens or hundreds of instructions in the place of one, and the obvious speed hit.
Saving state is a potentially worse problem. Some things in a CPU are not easily saved on a context switch. A good example of this are 'hidden' Segment-Register States. Once they are loaded into memory, some portions of them cannot be saved leading to unexpected discontinuities between the memory resident portions and the actual values in the CPU. There are workarounds of course, but they are tricky and expensive performance wise.
Instructions that do not fault when they should pose an obvious problem. If you are expecting an instruction to cause a fault that you later trap, and it doesn't, hilarity ensues. Hilarity for the person writing code in the cubicle next to you at the very least. If you are trying to make things work under these conditions, it isn't all that much fun.
The opposite case is things that fault when they should not, and these things tend to be very common. Writes to CR0 and CR4 will fault if not running in the correct ring, leading to crashes or if you are really lucky, lots and lots of overhead. Both of these fault trappings, or lack thereof are eminently correctable on the fly, but they cost performance, lots of performance.
What the entire art of virtualization comes down to is moving the OS to a place where it should not be, and running around like your head is on fire trying to fix all the problems that come up. There are a lot of problems, and they happen quite often, so the performance loss is nothing specific, but more of a death by 1000 cuts.
This is Part 2 of 4. µ
Part 1 can be found here.
Sign up for INQbot – a weekly roundup of the best from the INQ