I ran into a slightly confusing problem today - our SQL servers are all created with 4 disks on 4 separate LUNs (System, Swap, SQL Data and SQL Logs). When viewing the server through Virtual Center I couldn’t see all of the LUNs, just the System LUN. It’s not a major problem as the VM can see the storage, but a little annoying when you have to remember what LUN the disks are on.
Slightly more distressing was the fact that the System-LUN was running out of space - fast. A LUN that should have had about 150GB free was running dangerously low. On investigation I found various snapshot files were being stored in with the System-LUN, which is where the VM’s VMX, vswap etc are situated. These were the snapshot delta files of the additional disks, which were on other storage! This isn’t first apparent at first as the disk snapshots have been named sequentially by ESX, so a VM with 4 disks on separate LUNs will in fact create 4 snapshot files on the SYSTEM-LUN named VM01-00001.vmdk, VM01-00002.vmdk, VM01-00003.vmdk and VM01-00004.vmdk. 00001 is for the System disk, 00002 is for the Swap disk etc etc. This means that the IO on that LUN has been multiplied, and the storage space is shrinking very rapidly.
A little more digging and it seems that this is by design - snapshots are not meant to be kept for very long, and I think VMware made a deliberate decision to make it difficult to do so. Any virtual disks created for a VM, lets call it VM01, were named VM01.vmdk. When additional virtual disks were created through vCenter on a different LUN, they were still named VM01.vmdk - there’s no conflict because they’re in different locations. However, when vCenter takes a snapshot it places them with the original disk, and because it’s got the same name as the existing disk it starts to enumerate them.
This is bad for a number of reasons - most prominent of which is that if the snapshot file grows large, vCenter does not handle the commit well. In fact, neither does ESX, but I’ll get to that. vCenter will time out on any operation that takes more than 15 minutes, so a commit of a 10GB snapshot will look for all intents and purposes in vCenter like it’s failed. On top of that, the enumeration of snapshot delta files can cause confusion as to which disk it actualy belongs to, and if that happens, commiting
We all know snapshots are performance killers, but the functionality they provide is not insignificant, and as with most things a balance has to be struck between the functionality and the performance.
VMs created with disks on multiple LUNs in vCenter use the SAME DISK NAME (eg; for VM01 the disks were created in /vmfs/volumes/SYSTEM-LUN/VM01.vmdk, /vmfs/volumes/SWAP-LUN/VM01.vmdk etc etc).
Snapshots cause ALL disk delta files onto the “system” LUN (i.e. where your VMX file is stored.) This is bad because a) it multiplies your I/O on that disk and b) you negate the benefits of storing on multiple LUNs.
Commiting large snapshots takes time - LOTS of time - and can have a big performance hit on your server.
vCenter has a hard coded 15m timeout.
when I say there’s no other way, I mean, there’s no other practical way. There are methods to move the snapshot files to another LUN but they bring some serious problems with them.