In this previous post I introduced the VMware vStorage APIs. It’s time to explore this a bit further.

As I mentioned, vStorage APIs come in two flavors: VADP and VAAI. They are both interesting, but for this discussion, let's focus on the basics of the VAAI. This specific set of APIs are designed so that the VMware kernel and storage arrays can better integrate with each other. The gist is that a lot of the tasks that were performed by the ESX Kernel can now be offloaded to the storage arrays. This is good for several reasons:

  • The VMware kernel/physical server is freed from doing tedious, but resource-intensive tasks.
  • The arrays have better "visibility" into what's being stored on them and can better optimize their functions.
  • Time-consuming tasks can be minimized and optimized.
  • Scalability is improved.

How's that done? Well, in vSphere 4.1, there are three "primitives" (supposed to be four, but we’ll discuss that later). These primitives are specific functions that the arrays’ vendors can choose to implement. Each of these delegate specific functionality from the ESX host down to the array. They are:

  • __Full Copy__: Offloads the copying of data from the ESX host down to the array.
  • __Block Zeroing__: The array does the work of zeroing out large chunks of space on disk.
  • __Hardware-assisted Locking__: extends the ways that VMFS protects critical information

Let's look Full Copy and Block Zeroing in this post. I’ll follow up with another post on Hardware-assisted Locking.

Full Copy

One of the most common tasks in VMware is the creation of virtual machines (VMs). Typically, this is done from a template. Without VAAI, the ESX host tasked with creating the copy is doing all of the I/O. It literally has to read each block from the template VM and copy it over to the destination VM. This is very time consuming and can be quite a burden on the ESX host, not to mention the HBA, SAN, etc. With __Full Copy__ enabled, the ESX host is still involved in the copy operation, but mostly as a controller. The vast majority of the work is done by the array itself. This not only frees up the host from having to do the I/O, but it can dramatically improve the performance of the operations. We have seen performance increase with some arrays by as much as 10 times. As time goes by and the arrays get smarter about this, there's no reason not to expect even higher performance numbers.

Another place you end up moving lots of bits is when you do Storage vMotion. Arrays that implement __Full Copy__ can also offload this function from the ESX host. Similar to VM provisioning, an ESX host is still involved in the process, but only as a controlling mechanism. The bulk of the I/O is handled by the array. This too has a significant increase in performance, but our experience has not been as dramatic as during provisioning. I fully expect this to get better over time as well.

Block Zeroing

When you create a VM on block storage, you have three options on how to allocate the space: thin provision, zeroedthick, and eagerzeroedthick. The first one basically tells VMware not to pre-allocate any space on disk for the VM. The VM essentially thinks it has a disk of size x, but only consumes space on disk when the guest OS does a write. Very cool, but this has several performance implications. We'll discuss that some other time.

With zeroedthick format, VMware pre-allocates all of the space for a VMDK disk image. However, it defers the zeroing (blanking) of that space until the first time the guest OS accesses each specific block. With this, you create your VMDKs really fast, but you pay a performance penalty each time you access a block for the first time because you have to first zero it before you write your actual data.

In contrast, eagerzeroedthick format also pre-allocates all of the space for a VMDK disk image and at the same time zeroes out all of the blocks. As you might imagine, this takes a long time, and the ESX host has to write all of those zeros. The good news is that after this, the VM doesn't pay a penalty when it starts using its disk image. For many VM administrators, this is sufficient for them to pay the deployment cost. __Block Zeroing__ alleviates this and gives you the convenience of zeroedthick deployment with the speed of eagerzeroedthick. Naturally, this is done by the API by offloading the "zero-writing" down to the array. It still takes time, but it is faster and the ESX host doesn't have to generate all those I/Os.

One of the neat things that comes from this is that if your array supports thin provisioning at the hardware level and VAAI, it can start doing some nifty tricks. First, if the LUN is thin provisioned, the array doesn't have to actually write zeroes to each block. VAAI tells the array to do it, and the array responds immediately that it did – but in reality, all it did was throw those operations away. You see, you don't need to zero them out. They haven't been used (they are all zeros after all). The array is smart enough to know that, and it does not allocate the space until you really, really need it.

Also, arrays that support thin provisioning often provide the ability to reclaim "zeroed pages." With thin provisioning, LUNs start off small, but grow over time even if the OS has deleted files. The OS and the array don't communicate and the array typically doesn't know that a file has been deleted and its blocks can be reclaimed. There are some neat ways of solving this with physical hosts, but they don't really map all that well to VMware VDMK files. That was a big problem – that is until VAAI and __Block Zeroing__. With this combo, smart arrays can now give you space efficiency and ... well space efficiency. Trust me. It's very nice.