VMware Cloud Provider

Using Container Service Extension (CSE) 4.1 with the Terraform VMware Cloud Director Provider v3.11.0

The Terraform VMware Cloud Director Provider v3.11.0 now supports installing and managing Container Service Extension (CSE) 4.1, with a new set of improvements, the new vcd_rde_behavior_invocation data source and updated guides for VMware Cloud
Director users to deploy the required components.

In this blog post, we will be installing CSE 4.1 in an existing VCD and creating and managing a TKGm cluster.

Preparing the installation

First of all, we must make sure that all the prerequisites listed in the Terraform VCD Provider documentation are met. CSE 4.1 requires at least VCD 10.4.2, we can check our VCD version in the popup that shows up by clicking the About option inside the help “(?)” button next to our username in the top right corner:

Check that you also have ALB controllers available to be consumed from VMware Cloud Director, as the created clusters require them for load-balancing purposes.

Step 1: Installing the prerequisites

The first step of the installation mimics the UI wizard step in which prerequisites are created:

We will do this exact step programmatically with Terraform. To do that, let’s clone the terraform-provider-vcd repository so we can download the required schemas, entities, and examples:

If we open 3.11-cse-install-2-cse-server-prerequisites.tf we can see that these configuration files create all the RDE framework components that CSE uses to work, consuming the schemas that are hosted in the GitHub repository, plus all the rights and roles that are needed. We won’t customize anything inside these files, as they create the same items as the UI wizard step shown in the above screenshot, which doesn’t allow customization either.

Now we open 3.11-cse-install-3-cse-server-settings.tf, this one is equivalent to the following UI wizard step:

We can observe that the UI wizard allows us to set some configuration parameters, and if we look to terraform.tfvars.example we will observe that the requested configuration values match.

Before applying all the Terraform configuration files that are available in this folder, we will rename terraform.tfvars.example to terraform.tfvars, and we will set the variables with correct values. The defaults that we can see in variables.tf and terraform.tfvars.example match with those of the UI wizard, which should be good for CSE 4.1. In our case, our VMware Cloud Director has complete Internet access, so we are not setting any custom Docker registry or certificate here.

We should also take into account that the terraform.tfvars.example is asking for a username and password to create a user that will be used to provision API tokens for the CSE Server to run. We also leave these as they are, as we like the "cse_admin" username.

Once we review the configuration, we can safely complete this step by running:

The plan should display all the elements that are going to be created. We complete the operation (by writing yes to the prompt) so the first step of the installation is finished. This can be easily checked in the UI as now the wizard doesn’t ask us to complete this step, instead, it shows the CSE Server configuration we just applied:

Step 2: Configuring VMware Cloud Director and running the CSE Server

We move to the next step, which is located at examples/container-service-extension/v4.1/install/step2 of our cloned repository.

This step is the most customizable one, as it depends on our specific needs. Ideally, as the CSE documentation implies, there should be two Organizations: Solutions Organization and Tenant Organization, with Internet access so all the required Docker images and packages can be downloaded (or with access to an internal Docker registry if we had chosen a custom registry in the previous step).

We can investigate the different files available and change everything that doesn’t fit with our needs. For example, if we already had the Organization VDCs created, we could change from using resources to using data sources instead.

In our case, the VMware Cloud Director appliance where we are installing CSE 4.1 is empty, so we need to create everything from scratch. This is what the files in this folder do, they create a basic and minimal set of components to make CSE 4.1 work.

Same as before, we rename terraform.tfvars.example to terraform.tfvars and inspect the file contents so we can set the correct configuration. As we mentioned, setting up the variables of this step depends on our needs and how we want to set up the networking, the NSX ALB, and which TKGm OVAs we want to provide to our tenants. We should also be aware that some constraints need to be met, like the VM Sizing Policies that are required for CSE to work being published to the VDCs, so let’s read and understand the installation guide for that purpose.

Once we review the configuration, we can complete this step by running:

Now we should review that the plan is correct and matches to what we want to achieve. It should create the two required Organizations, our VDCs, and most importantly, the networking configuration should allow Internet traffic to retrieve the required packages for the TKGm clusters to be provisioned without issues (remember that in the previous step, we didn’t set any internal registry nor certificates). We complete the operation (by writing yes to the prompt) so the second step of the installation is finished.

We can also double-check that everything is correct in the UI, or do a connectivity test by deploying a VM and using the console to ping an outside-world website.

Cluster creation with Terraform

Given that we have finished the installation process and we still have the cloned repository from the previous steps, we move to examples/container-service-extension/v4.1/cluster.

cd examples/container-service-extension/v4.1/cluster

The cluster is created by the configuration file 3.11-cluster-creation.tf, by also using the RDE framework. We encourage the readers to check both the vcd_rde documentation and the cluster management guide before proceeding, as it’s important to know how this resource works in Terraform, and most importantly, how CSE 4.1 uses it.

We will open 3.11-cluster-creation.tf and inspect it, to immediately see that it uses the JSON template located at examples/container-service-extension/v4.1/entities/tkgmcluster.json.template. This is the payload that the CSE 4.1 RDE requires to initialize a TKGm cluster. We can customize this JSON to our needs, for example, we will remove the defaultStorageClassOptions block from it as we won’t use storage in our clusters.

The initial JSON template tkgmcluster.json.template looks like this now:

There’s nothing else that we can customize there, so we leave it like that.

The next thing that we notice is that we need a valid CAPVCD YAML, we can download it from here. We’ll deploy a v1.25.7 Tanzu cluster, so we download this one to start preparing it.

We open it with our editor and add the required snippets as stated in the documentation. We start with the kind: Cluster blocks that are required by the CSE Server to provision clusters:

We added the two labels and annotations blocks, with the required placeholders TKR_VERSION, CLUSTER_NAME, and TKGVERSION. These placeholders are used to set the values via Terraform configuration.

Now we add the Machine Health Check block, which will allow to use one of the new powerful features of CSE 4.1, that remediates nodes in failed status by replacing them, enabling cluster self-healing:

Notice that the timeouts have an s as the values introduced during installation were in seconds. If we hadn’t put the value in seconds, or we put the value like 15m, we can remove the s suffix from these block options.

Let’s add the last parts, which are most relevant when specifying custom certificates during the installation process. In kind: KubeadmConfigTemplate we must add the preKubeadmCommands and useExperimentalRetryJoin blocks under the spec > users section:

In kind: KubeadmControlPlane we must add the preKubeadmCommands and controllerManager blocks inside the kubeadmConfigSpec section:

Once it is completed, the resulting YAML should be similar to the one already provided in the examples/cluster folder, cluster-template-v1.25.7.yaml, as it uses the same version of Tanzu and has all of these additions already introduced. This is a good exercise to check whether our YAML is correct before proceeding further.

After we review the crafted YAML, let’s create a tenant user with the Kubernetes Cluster Author role. This user will be required to provision clusters:

Now, we can complete the customization of the configuration file 3.11-cluster-creation.tf by renaming terraform.tfvars.example to terraform.tfvars and configuring the parameters of our cluster. Let’s check ours:

We can notice that control_plane_placement_policy = "\"\"", this is to avoid errors when we don’t want to use a VM Placement Policy. We can check that the downloaded CAPVCD YAML forces us to place double quotes on this value when it is not used.

The tkr_version and tkg_version values were obtained from the already provided in the documentation.

Once we’re happy with the different options, we apply the configuration:

Now we should review the plan as much as possible to prevent mistakes. It should create the vcd_rde resource with the elements we provided.
We complete the operation (by writing yes to the prompt) so the cluster should start getting created. We can monitor the process either in UI or with the two outputs provided as an example:

Then we can do terraform refresh as many times as we want, to monitor the events with:

Once computed_k8s_cluster_status states provisioned, this step will be finished and the cluster will be ready to use. Let’s retrieve the Kubeconfig, which in CSE 4.1 is done completely differently than in 4.0, as we are required to invoke a Behavior to get it. In 3.11-cluster-creation.tf we can see a commented section that has a vcd_rde_behavior_invocation data source. If we uncomment these and do another terraform apply, we should be able to get the Kubeconfig by running

We can save it to a file to start interacting with our cluster and kubectl.

Cluster update

Example use case: we realized that our cluster is too small, so we need to scale it up. We’ll set up 3 worker nodes.

To update it, we need to be sure that it’s in provisioned status. For that, we can use the same mechanism that we used when the cluster creation started:

This should display provisioned. If that is the case, we can proceed with the update.

As with the cluster creation, we first need to understand how the vcd_rde resource works to avoid mistakes, so it is encouraged to check both the vcd_rde documentation and the cluster management guide before proceeding. The important idea is that we must update the input_entity argument with the information that CSE saves in the computed_entity attribute, otherwise, we could break the cluster.

To do that, we can use the following output that will return the computed_entity attribute:

Then we run this command to save it to a file for a better reading:

Let’s open computed.json for inspection. We can easily see that it looks pretty much the same as tkgmcluster.json.template but with the addition of a big "status" object that contains vital information about the cluster. This must be sent back on updates, so we copy the whole "status" object as it is and we place it in the original tkgmcluster.json.template.

After that, we can change worker_machine_count = 1 to worker_machine_count = 3 in the existing terraform.tfvars, to complete the update process with:

Now it is crucial to verify and be sure that the output plan shows that the "status" is being added to the input_entity payload. If that is not the case, we should stop the operation immediately and check what went wrong. If "status" is visible in the plan as being added, you can complete the update operation by writing yes to the prompt.

Cluster deletion

The main idea of deleting a TKGm cluster is that we should not use terraform destroy for that, even if that is the first idea we have in mind. The reason is that the CSE Server creates a lot of elements (VMs, Virtual Services, etc) that would be in an “orphan” state if we just delete the cluster RDE. We need to let the CSE Server do the cleanup for us.

For that matter, the vcd_rde present in 3.11-cluster-creation.tf contains two special arguments, that mimic the deletion option from UI:

To trigger an asynchronous deletion process we should change them to true and execute terraform apply to perform an update. We must also introduce the most recent "status" object to the tkgmcluster.json.template when applying, pretty much like in the update scenario described in the previous section.

Final thoughts

We hope you enjoyed the process of installing CSE 4.1 in your VMware Cloud Director appliance. For a better understanding of the process, please read the existing installation and cluster management guides.