posted

0 Comments

By Tom Scanlan, Senior Consultant, VMware

Linux containers have been great for stateless workloads. While stateful workloads can also run in containers, a limiting factor has been that most methods of providing storage for the state have been confined to the host serving the container. And if that host fails, the storage becomes inaccessible. Not so with vSphere Integrated Containers. (VIC) VIC leverages vSphere’s advanced persistence capabilities to allow access to data even in the event of a host failure.

Let’s dive into some of the storage concerns with standard container solutions and see how they are addressed in vSphere Integrated Containers.

Docker image layers

Container images are different from running containers. The images are static artifacts that are built and stored in Docker registries for use when running a new container. Images are just a set of files that make up the filesystem available to a running container.

Running containers are composed of layers of images applied in a stack. The underlying layers remain unaltered. While running, any changes to the filesystem will be persisted to an extra layer called the container layer. The container layer is removed when the container is removed.

Here is an example to illustrate what’s going on in image layers. This Docker file builds on top of the alpine-3.6 image layer.

Building an image using this Docker file results in an image with several layers that can be seen using docker history.

The alpine layer is there at the bottom, and our additional commands have generated a few more layers that get stacked on top to be the image we want. The final image that has all of the files we need is referenced by the ID 31af83e49686 or by the tag demo:0.1. Each of those layers should be stored in a registry, and can be reused by future images.

When we run the container, an additional container layer is created that allows modification of the filesystem by the running system. If no changes are made to the filesystem, this layer remains empty. Let’s run the image as a container and modify its filesystem:

As long as the container runs, the file demo-state will exist and have the same contents. Stopping and starting the container has no effect on the container layer, so demo-state will still exist.

If we stop and remove the container, the container layer will be removed as well as our hold on the demo-state file. Running a new instance of the container will have a new empty container layer.

For more details on the structure of images see https://docs.docker.com/engine/userguide/storagedriver/imagesandcontainers/

I want to call out a distinction here that the container layer is ephemeral storage. It’s around for as long as the container, and is lost when the container goes away. This is in contrast to requirements for data that needs to remain after the container is removed. I’ll talk about persistent storage coming up.

Why aren’t containers persistent?

The lack of persistence in the image layers is by design. By choosing to only allow ephemeral storage, we can ensure the application we put into a container image is always the application being run. Images are versioned so that we can be sure that two systems are running exactly the same code. Re-running the same image will always produce the same running conditions.

The immutability of the images results in better debugging, smoother deployments and the ability to quickly replace running applications that appear to be in a bad state.

Let’s flip it around—if container images were able to change, how could I be sure running a specific image today and running it tomorrow would have the same results? How could I debug an image on my laptop and be sure I am seeing the same code that is having a problem in QA? If an application has persisted state in its local image, how do other instances of the application container get access to that data?

Ok, lack of persistence is good for the container… how do I save data?

At some point, most of our applications need to leverage some data. How do we keep state between runs of an image? There are at least a few patterns:

  1. replication
  2. recreate data or replay transactions
  3. filesystem persistence

Replication

If you can design your application to replicate data to other containers and ensure at least one copy is always running, then you’re using this pattern.

An example of this pattern is running a Cassandra database cluster, where replication enables the dynamic addition or removal of nodes. If you’re running Cassandra in containers and being good about bootstrapping and removing nodes, then you could run a stable database cluster with normal basic docker run. The persistence is handled by storing data in the container layer. As long as enough containers are up, persistence is maintained.

Re-create or replay data on loss

If you can design your application to be able to recreate any needed data, you’re using this pattern.

An example of this might be a prime number finder tasked with finding a set of prime numbers in a broader range of numbers by counting up from the low end of the range and testing each number for primality. If the primes are stored for future use, but the data is lost for any reason, a new instance of the process can scan the same range and would find the same numbers that the original process found. In this case, the data is inherent to the requirements of the process, so the data can be recreated.

A more efficient variant of this process would store each prime number found and the last number tested in apache Kafka. Given a consistent initial range and the transaction log, you can quickly get back to a known state without retesting each number for primality, and continue processing from there.

Persistent filesystem

We can leverage an existing persistent filesystem that lives on the Docker host inside the container. This is a pattern most of us are familiar with, as it has been the way to handle data persistence since tape drives were invented.

Docker has two ways of handing a persistent filesystem in containers: bind mounts and volumes. Both of these expose a filesystem into the container from the running host. They are very similar, but the bind mount is a bit more limited than using volumes.

Bind mount

This is simply mounting a host filesystem file or directory into the container. This is not very different from mounting a CD-ROM onto a virtual machine (VM). The host path may look like /srv/dir-to-mount, and inside the container you may be able to access the directory at /mnt/dir-to-mount.

Bind mounting is used all the time in development, but should never be used in production. It ties the container to the specific host at runtime, and if the host is lost, so is the data. Volumes are the answer for production requirements.

Volumes

Volumes are the preferred way to use persistent storage in Docker.

This is slightly different from a simple bind mount. Here, Docker creates a directory that is the volume, and mounts it just like a bind mount. In contrast to bind mounts, Docker manages the lifecycle of this volume. By doing so, it provides the ability to use storage drivers that enable the backing storage to exist outside of the host running the container.

vSphere Integrated Containers leverages this to use vSphere storage types like vSAN, iSCSI and NFS to back the volume. Doing this means you can handle failures of any host running the container, and ensure access to the data in the volume can resume when the container is started on a different host.

Another example of leveraging the storage drivers of Docker volumes is shown in vSphere Docker Volume Service. This driver enables the use of vSphere-backed storage when using native Docker hosts, not vSphere Integrated Containers.

For deeper coverage on volumes, see Docker’s volume document. Now, let’s take a closer look at using volumes to persist data in vSphere Integrated Containers.

VIC volumes

Command line use of volumes in vSphere Integrated Containers is the same as standard Docker, with the added benefit of the storage being backed by vSphere Storage.

In vSphere Integrated Containers, if you want to use volumes that are private to the container, you can use the iSCSI or vSAN storage in vSphere. If you have data that should be shared into more than one container, you can use an NFS backed datastore from vSphere.

When setting up a container host in vSphere Integrated Containers, you specify the datastores that will be available for use by any containers running against that host. These are specified using the --volume-store argument to vic-machine. These backing volume-stores can be set or updated using vic-machine configure. Volumes added can only be removed by removing the container host, but that usually isn’t a problem.

Here is an example showing the command that would create the container host and enable it to present volumes with various backing stores.

The first volume store is on a vSAN datastore and uses the label backed-up-encrypted so that a client can type docker volume create --opt VolumeStore=backed-up-encrypted myData to create a volume in that store. The second uses cheaper storage backed by a FreeNAS server mounted using iSCSI, and is used for storing log data. Note that it has the label “default,” which means that any volume created without a volume store specified is created here. The third and fourth are for two types of NFS exports. The first being an NFS datastore presented by vSphere, and the other a standard NFS host directly (useful if you want to share data between containers).


Note regarding NFS gotcha: NFS mounts in container can be tricky. If you notice that you cannot read or write files to an NFS share in container, then you have probably hit this gotcha.Note the final volume store above has uid and gid arguments. There are two competing concerns. First, Docker will generally run as uid and gid 0, or as root. You can change that behavior by specifying a USER in the Dockerfile or on the command line. See Docker user command for details on how to set it. Second, NFS has many ways permissions based on uid and gid are applied to the mounted filesystem. You must ensure that the user of the running container matches the uid and gid permissions on the files exported by NFS. Finally, note the syntax for native Docker NFS volumes and VIC NFS volumes is different, so if trying to apply this to native Docker, you’ll want to start here.


Once you’ve installed the VCH, you’ll notice that there are now empty folders created on the respective datastores ready for volume data:

Create and use volumes

Let’s go ahead and create volumes using the Docker client. Note the implied use of the default volume store in the second example.

After volume creation, you’ll see the following files were created in the backing datastores:

To show the most basic level of persistence, here we run a container that drops some data on each of the datastores and check that it exists from another container. In production, this could be a database workload hosted in a container and operating on the persistent external storage.

Right now, only native NFS volumes are allowed to share data between more than one container. Here is an example of sharing some storage between containers using native NFS.

As a final note, if you have a stateful process that can handle restart, VMware HA will enable restarting the container on a new ESXi host if the original ESXi host fails. If your process can’t implement a replay or replication pattern to recover state on failure, then VMware Fault Tolerance enables transparent continuation of processing during an ESXi host failure. In this case the container VM continues running on the new ESXi host as though there were no failure of the original host. We’ll see if we can make a blog entry demonstrating the Fault Tolerance feature.

Here is an example of VMware HA helping a container resume running on a new host after failure of the initial ESXi host. This is the picture before failure:

vSphere Integrated Containers

And after causing an ESXi host failure, the container is moved to and started on a different ESXi host:

vSphere Integrated Containers

So, there you have it: vSphere Integrated Containers can provide resilient storage and cope with host failures. Not mandatory during development, but definitely a boon in the production landscape.

Stay tuned to the Cloud-Native Apps blog for more posts around vSphere Integrated Containers, and be sure to follow us on Twitter (@cloudnativeapps).