NESTED PAGE Tables (NPT) takes this a step further, and fully implements a virtualised CR3 in hardware. What NPT does is to give each guest its own CR3 that is loaded and saved on each VM entrance and exit. Since each guest has its own CR3 that is separate from the other guest's CR3s, it can do whatever it wants to the register, and the MMU works out the details in hardware.
This works by adding more recursive lookups to the MMU. Instead of the program thinking it has a space, and the OS juggling the numbers to make things work, you also add in the VMM via the MMU. This means the programs manage their space, the guest OS manages a group of programs, and the VMM manages a group of guest OSes, and the programs running inside them. Instead of two layers of lookups, you have three, and you take a very mild speed hit. Since it is all done in hardware, and the lookups are cacheable, the speed hit is negligible, especially compared to SPT or complete software management.
The magic is in the CR3 register, or in this case, the multiple CR3 registers. Nothing needs to be intercepted, each guest has their own world that looks like a non-virtualised world, and they go about their merry way.
The new memory modes on Pacifica take a task that is done often and repeatedly, and move it from software emulation into hardware. It has the potential to make memory heavy guest OSes and applications vastly faster than other virtualisation mechanisms, hardware or software, that lack NPT.
Another minor new mode is called Paged Real Mode, and it does just about exactly what it sounds like. Normally, in Real Mode, there is no paging, you get what you get, hence the name real. If you are trying to virtualise a Real Mode OS, and the rest of the world is in a paged mode, you have a problem. With Paged Real Mode, you can virtualise a program that requires Real Mode and they will work correctly.
The last bit of note for memory is the Tagged Transition Look-Aside Buffer, or TTLB. A TLB is a cache that stored the addresses of recently looked up pages. If you go to a memory address, and that request gets passed to the MMU for several recursive lookups, it can take time. The TLB caches the result of that lookup, so if you ask for the results at 2MB, and it passes you to the real address of 4MB, the TLB caches that 2MB is at 4MB.
Next time you look up 2MB, rather than the MMU figuring out that it is at 4MB all over, taking many cycles, it just reads it from the cache improving performance. Now, if you have a VMM sitting over the top of the OS, and many OSes running, you can have many things looking up 2MB and thinking they should be at the 'real' 4MB, but none of them actually should be there.
Tagged TLBs take care of this by assigning an ID to the TLB entries. This allows the hardware to know that an entry belongs to a specific guest OS, and avoid the conflicts that would otherwise arise.
The other main feature that Pacifica adds is called the Device Exclusion Vector (DEV), and it does what it sounds like, it excludes devices from accessing memory. To speed up a computer, and lessen the load on a CPU, modern x86 computers use a mechanism called DMA or Direct Memory Access. The old pre-DMA method was to have a computer copy the data off of a device like a disk drive, and then save it to memory. It was slow and tied up the CPU for a long time. DMA allows the CPU to set a few addresses, like start and end for the device, and where it should go in memory, and sets up a channel between the two. The devices can then do what they need without CPU intervention, and just signal when they are done. The CPU can do other things until the process is complete.
Now, if you understood the whole concept of page tables, and how virtualisation can change things on the fly, you probably can see how a device acting directly on memory can be a very dangerous thing. If a program running on a guest tells a disk drive to load a 100K file into memory starting at 3MB, it can quickly turn into a mess. Is that 3MB as seen by the program, guest OS, or the 'real' 3MB as seen by the VMM?
To make matters worse, what if a single guest OS decides that it completely owns a particular disk drive, and another guest does also? How do you keep them from either fighting or worse yet overwriting each other's data like the pre-MMU programs of old? To be honest, the DEV will not solve all of these problems, but it will give a VMM programmer the tools necessary to enforce the things he or she needs to, and make sense of it all.
The DEV is a chunk of memory that tells a device if it is allowed to access a page of memory or not. Since a page is owned by a particular guest exclusively, this mechanism can effectively lock a device to an specific guest OS. From that point on, the guest OS would then manage a disk drive, video card or whatever in the same old way it usually does. As far as the guest was concerned, it has complete access to the device, just like it would if it was not virtualised. If you set the bit to exclude the device, it does just that, and the guest in question can not see the device in question.
This is all fine and dandy until you consider one of the fundamental properties of the K8 based chips, that they have an interconnect mechanism called Hypertransport that allows devices to be directly connected to CPUs. The NVidia NForce4 chipset will allow you to have multiple north bridges, each of which can have its own PCIe slots. This means a CPU0 can talk to a drive connected to CPU3, and a video card on CPU2 can send data to CPU0 at the same time.
Now, if you have Linux running as a guest on CPU0, and Windows XP as a guest on CPU3, this can get ugly really quickly. How do you manage it all, and stop the devices from reading and writing from wherever they want? DEV to the rescue. It is a lookup table that says yes or no each time a device tries to access a page. This check is made on the boundary between Hypertransport, the bus where the devices are attached, and Cache Coherent Hypertransport, a similar bus that connects the CPUs together.
This test at the boundary allows any given request to be checked once, and then passed between CPUs where it can be acted upon. Once it is marked safe, anything that should act upon it can. If the request is denied, it never gets far enough to do any harm.
How it is managed, and whether or not there is a full copy of the DEV on each CPU versus only the data for devices connected to the specific CPU is actually not up to the hardware. The DEV is only a mechanism, how you use it is a question for the VMM writers. Some may want things done in one way, others another, and some might not want to use it at all, shouldering the burden in software. Whatever the case, the DEV is just a tool, and it allows you to write a VMM in the fashion you choose.
There is one more goody in the Pacifica spec that will be of interest, an instruction called SKINIT, it is the entry way into Presidio, the upcoming security technology. While the whole Secure Virtual Machine, Trusted Platform Module, and hardware security is out of the scope of this article, it is worth knowing that it will be done through virtualisation. Intel's LeGrand (LT) will be done in the same way, basically you spawn a hardware secured OS instance, and the other OSes can't see or interact with them.
Looking back over the Pacifica spec, it is clear that it is indeed a bigger body of water than a Vanderpool. The basic architecture of the K8 gives AMD more toys to play with, the memory controller and directly connected devices. AMD can virtualise both of these items directly while Intel has to do so indirectly if it can do so at all.
This should allow an otherwise identical VMM to do more things in hardware and have lower overhead than VT. AMD appears to have used the added capability wisely, giving them a faster and as far as memory goes, more secure virtualisation platform. ยต