This is the first part of a post that I have been promising for quite some time. We have large amounts of data that we need to replicate to many locations. For the 2013 Hands-on Labs “season,” we used two methods to move our data. One of them is generally applicable while the other is more specialized and tailored to our use case.
In this post, I will cover the more general use case because that process is used when we must “seed” a new cloud. I think the more specific use case is the more interesting of the two cases, but we will need to save that for the next post.
I am regularly asked about why we do not use storage array replication for this process. The short answer is that we are true cloud consumers and, as such, do not have access to the back-end storage at that level. Additionally, we are not guaranteed that the same storage platform will be available in each of our clouds, and that makes array replication a challenge. Case in point, we had EMC VNX arrays in our development environment, but used EMC XTremIO to host most of the labs at VMworld. Making no assumptions about the storage platform requires us to build a bit of flexibility into our solution and work at a higher level. The benefit is that the concepts can be applied to most environments, possibly even yours.
In previous posts, we have covered the vCD Export and Import processes. So, for the sake of this post, let us assume that there is an exported vPod on the disk of our “catalog host” in the source cloud and we need to get it onto the “catalog host” in our target cloud so it can be imported to its local vCD. What are these “catalog hosts” and why are they present?
Seeding the Clouds
When standing up an entirely new cloud to host our labs, we need to push a large amount of data into that cloud. The entire HOL catalog represents around 3 TB on disk. Typically, it is not practical to import the vPods directly into vCD over a long distance link. Unreliability of WAN links, high latency, and low bandwidth all play into the challenges with that process. My thought was that the export has to live somewhere, so why not as close as possible to the source… and target?
Yes, we could export all of the vPods, back them up to tape or put them on an external hard drive, ship it, and import it on the target side. As I mentioned, we are true cloud consumers, so that has its own difficulties since we don’t typically have access to the hardware “behind the curtain.” We have a great relationship with our provider — VMware’s internal OneCloud team — so we could probably get it done somehow: never underestimate the bandwidth of a box of backup tapes… but the latency is a killer! If we only had to do this once, and we knew that all of the vPods would be 100% complete by a certain date, that might work. As it is, we generally have to be more flexible than that when dealing with software release schedules and incremental updates to our labs.
So, part of our (non-FedEx-net) solution is to leverage local “catalog hosts.” These machines are well-connected (LAN) to a vCD instance and receive the vCD export files (OVFs) from their local “source” clouds and store them for transmission to their peers in the “target” clouds. For 2013, these were simple Windows machines with Cygwin installed. We use the base Cygwin with only two packages added: lftp and rsync. These are the main tools in our replication toolbox, but we also install PowerCLI with the vCloud Director option and have created a small library of functions to help us with processing the exports. Many of those have been outlined in previous posts, and a few new ones will be covered in future posts.
In our simplest use case, we have a vPod called BASE-VPOD-v1 exported to
E:\HOL-Lib\BASE-VPOD-v1 on the source cloud’s catalog host. We need to get those files to the
E:\HOL-Lib directory on the target cloud’s catalog host. In Cygwin, these paths translate to
This isn’t as simple as mapping a drive or shooting the file to a UNC path. Typically, these are WAN connections and those file copies would time out long before they got close to being finished. Because most of the time we are using the Internet for our transfers (no VPN in place), our stuff usually uses SSH with public keys for authentication and rides along an encrypted connection. We’re generally not moving sensitive information, but we like to follow good security practices wherever we can.
We could just open an SFTP session between the catalog hosts and push or pull the files across the link. That works, but isn’t terribly efficient. One limitation we discovered with basic SFTP and SCP clients was that they are very serial: files are transferred one at a time. Sure, we can spawn multiple sessions and some tools handle multiple simultaneous files, but each file is generally able to use only a single stream, which can limit the amount of bandwidth any given transfer can consume.
This can be problematic for us since we may have a bunch of small files (less than 1 GB) in a vPod with a few big files (30+ GB). In order for us to begin the import of the vPod, we need the entire set of files replicated. We needed a way to replicate the big files quickly, too. Oh, and the issue of resuming where we left off in the event of a network disconnect — that would be nice, too. There is nothing like waiting half a day for a 25 GB file to move, only to have something happen during the last 5 minutes of the transfer that requires you to start over! After trying a few options, many of which required a high degree of manual interaction, we settled on LFTP due to both its functionality and ease of use.
This is an example of an LFTP command line, run from the target-catalog catalog host. It replicates (pulls) BASE-VPOD-v1 from the source-catalog catalog host:
lftp -c "mirror --use-pget-n=5 -p --verbose --parallel=4 sftp://username:xxx@source-catalog:/cygdrive/e/HOL-Lib/BASE-VPOD-v1 /cygdrive/e/HOL-Lib/"
This command uses the mirror feature of LFTP and additionally leverages multi-file transfer (
--parallel=4), multiple streams per file (
--use-pget=5), and public key authentication via the SFTP locator. In this case, “xxx” is just a placeholder for the password if you wanted to use a password instead of keys. The (
-p) option tells LFTP not to monkey with the file permissions, and (
--verbose) is there because I like to see what is going on with the transfers. The following real example illustrates a run of this command to replicate the HOL-PRT-1309-v3 template.
catalog@target-catalog ~ $ lftp -c "mirror --use-pget=5 --parallel=4 -p --verbose sftp://catalog:xxx@source-catalog:/cygdrive/e/HOL-Lib/HOL-PRT-1309-v3 /cygdrive/e/HOL-Lib/" Transferring file `HOL-PRT-1309-v3.ovf' Transferring file `vm-2a953971-ef0c-4c3b-b4ea-16ec183d6dfd-disk-0.vmdk' Transferring file `vm-50c9fc1b-e64e-4a33-923a-e7f4dd99120b-disk-0.vmdk' Transferring file `vm-70e426ed-c808-4784-aba8-46041285a21a-disk-0.vmdk' Transferring file `vm-70e426ed-c808-4784-aba8-46041285a21a-disk-1.vmdk' Transferring file `vm-7afc2822-4e71-4f0c-ab48-9b4706a7591e-disk-0.vmdk' Transferring file `vm-b3aa7185-30ec-4a07-852a-ad5c6f1d8c7a-disk-0.vmdk' Transferring file `vm-e0fbf367-ab4a-497e-b7b9-c8723fe0e175-disk-0.vmdk' Transferring file `vm-edc1c0f9-9c16-4d2d-8235-e4ae7b9a61e7-disk-0.vmdk' Transferring file `vm-edc1c0f9-9c16-4d2d-8235-e4ae7b9a61e7-disk-1.vmdk' Transferring file `vm-f61cd448-da90-4cc8-ac8c-94c19d96a40f-disk-0.vmdk' catalog@target-catalog ~ $
Once this job has completed, we have the initial copy of our vPod on the catalog host that is local to the target cloud. We then use our Import-VPod PowerCLI function to import it into the target cloud for further processing.
You can vary the number of simultaneous files and number of streams per file to achieve the best results in your environment and for your use case. For example, you want to consider the available bandwidth and the profile of your data: are all of your files of similar size? If not, you may want to allocate more streams per file and handle fewer files at once.
Depending on the kind of bandwidth we are dealing with, we may transfer each of our vPods using LFTP. If the bandwidth is high enough, this is a pretty much no-brainer operation, especially since LFTP can mirror an entire directory tree with a single call. There are even some advanced options (
--only-missing) and (
--only-newer) available to assist with refreshing the trees and keeping them in sync. If bandwidth is on the lower end, or if we are refreshing a version of an existing vPod, our second replication mechanism using rsync can save a lot of time.
If you have been following along with this series, or you have used more than one of our labs, you probably understand that we leverage the same base content to create most of our labs. This means that there should be a large percentage of commonality among the vPods (vCD vApps), and that translates to commonality of the data on disk when we export those vPods from vCD for replication to other clouds. We can actually use that to speed up transfers between sites.
And, that, my friends, is the topic of my next post – Hands-on Labs: Differential Replication