First Steps with TKGm Guest Clusters in VMware Cloud Director 10.3

In the previous article, I've explained how to deploy Container Service Extension 3.1 with TKGm Support in VMware Cloud Director 10.3. In this article, I'm taking a look at how the Tanzu Kubernetes Grid Cluster is integrated into the Organization VDC and how the Tenant can access and work with the Kubernetes Cluster.

Prerequisites

This guide is from the Tenant's perspective and all actions can be done by the Tenant itself in the VMware Cloud Director Self Service Portal. However, it is required that the Service Provider has enabled "Tanzu Kubernetes Grid" support, which requires:

To deploy TKGm Clusters, the Tenant needs:

An Organization VDC
NSX-T backed Edge Gateway
TKGm Entitlement Rights Bundle
Routed Org Network (Self Service)
SNAT for the Routed Org Network to access Cloud Director

Deploy TKGm Kubernetes Cluster

The TKG Cluster is deployed by the Tenant. Login to VDC and navigate to More > Kubernetes Container Clusters, press NEW and follow the Wizard.

You can add an SSH Key in Step 2. This allows you to log in to Kubernetes Nodes as root.

Currently, only Single Control Plane Clusters are supported.

Do NOT disable Allow external traffic to be routed to this cluster! This feature is required by TKGm. Please be aware that the master node is fully exposed to the external network (Internet).

When you finish the Wizard, Cloud Director and CSE are deploying and configuring Kubernetes Master and Worker VMs as VAPP. It also creates a DNAT using a free IP address from the Edges IP Address Pool to fully expose the master node:

Please keep in mind that with this NAT configuration, every service Type NodePode you deploy in Kubernetes is available on the Internet.

Get Kubeconfig and Acces the Kubernetes Cluster with kubectl

When the cluster has been deployed you can download the kubeconfig from the UI:

You can also get the kubeconfig using API with the following API call:

GET https://[URL]/api/cse/3.0/cluster/[Cluster ID]/config

I made a small bash script that automatically pulls the configuration. You can get the script here.

# wget https://raw.githubusercontent.com/fgrehl/virten-scripts/master/bash/vcd_get_kubeconfig.sh
# chmod +x vcd_get_kubeconfig.sh
# ./vcd_get_kubeconfig.sh -u admin@org-fgr -p 'VMware1!' -v vcloud.virten.lab -c k8s-fgr > ~/k8s/k8s-fgr
# export KUBECONFIG=~/k8s/k8s-fgr

As mentioned earlier, the Kubernetes apiserver is exposed to the Internet and the public NAT address is automatically added to the downloaded kubeconfig. You should now be able to access the Cluster with kubectl.

# kubectl get nodes
NAME        STATUS   ROLES                 AGE   VERSION
mstr-sqpd   Ready    control-plane,master  45h   v1.20.5+vmware.2
node-knp3   Ready                          45h   v1.20.5+vmware.2
node-zrce   Ready                          45h   v1.20.5+vmware.2

The first thing to check is if all pods are up and running. All pods should be in Running state!

# kubectl get pods -A
NAMESPACE     NAME                                   READY   STATUS    RESTARTS   AGE
kube-system   antrea-agent-47j2t                     2/2     Running   0          94m
kube-system   antrea-agent-d7dms                     2/2     Running   0          91m
kube-system   antrea-agent-hslbg                     2/2     Running   0          89m
kube-system   antrea-controller-5cd95c574d-5z2np     1/1     Running   0          94m
kube-system   coredns-6598d898cd-9z55d               1/1     Running   0          94m
kube-system   coredns-6598d898cd-gf5qc               1/1     Running   0          94m
kube-system   csi-vcd-controllerplugin-0             3/3     Running   0          94m
kube-system   csi-vcd-nodeplugin-fwfz2               2/2     Running   0          90m
kube-system   csi-vcd-nodeplugin-xjz9l               2/2     Running   0          89m
kube-system   etcd-mstr-tfgn                         1/1     Running   0          94m
kube-system   kube-apiserver-mstr-tfgn               1/1     Running   0          94m
kube-system   kube-controller-manager-mstr-tfgn      1/1     Running   0          94m
kube-system   kube-proxy-8hbvz                       1/1     Running   0          89m
kube-system   kube-proxy-lsqxv                       1/1     Running   0          94m
kube-system   kube-proxy-r4twn                       1/1     Running   0          91m
kube-system   kube-scheduler-mstr-tfgn               1/1     Running   0          94m
kube-system   vmware-cloud-director-ccm-5489b6788c   1/1     Running   0          94m

The vmware-cloud-director-ccm pod is responsible for Cloud Director communication. If antrea-controller, coredns or csi-vcd-controllerplugin-0 are stuck in pending state, you probably have a communication problem with the Cloud Director. At this point, no workload pods can be started because without the Cloud Provivder Integration, all nodes are tainted:

# kubectl get events
LAST SEEN    TYPE     REASON           OBJECT                          MESSAGE
20s          Warning  FailedScheduling pod/webserver-559b886555-p4sjc  0/3 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate.

If that's the case, you can run a shell on the ccm pod to verify VCD communication:

# kubectl exec -it vmware-cloud-director-ccm-5489b6788c-fc86k -n kube-system -- sh

Or get Logs from the pod:

# kubectl logs vmware-cloud-director-ccm-5489b6788c-fc86k -n kube-system

Deploy and expose the first Container

As an example, I'm going to deploy an nginx webserver and expose it to the internet using the ALB integration. This is a very simple task in Kubernetes:

# kubectl create deployment webserver --image nginx
# kubectl expose deployment webserver --port=80 --type=LoadBalancer

It should download the nginx image, launch a pod and create a service of type LoadBalancer:

# kubectl get deployments
NAME         READY   UP-TO-DATE   AVAILABLE   AGE
webserver    1/1     1            1           47m

# kubectl get service
NAME         TYPE           CLUSTER-IP       EXTERNAL-IP    PORT(S)        AGE
kubernetes   ClusterIP      100.64.0.1       <none>         443/TCP        2d1h
webserver    LoadBalancer   100.67.6.30      203.0.113.13   80:32515/TCP   47m

From the output, you can learn that it has automatically assigned the address 203.0.113.13 from the tenant's IP pool. This address is not directly configured in the ALB loadbalancer. Every service type LoadBalancer you deploy uses an address from 192.168.8.2-192.168.8.100 to configure the NSX-ALB Loadbalancer and then creates a NAT to forward a public address to the load balancer.

For more information, about the VCD Cloud provider which also includes the NSX-ALB integration, refer to cloud-provider-for-cloud-director.

Deploy Internal Services

Currently, every service you deploy is automatically exposed to the Internet. You do not have an option to influence which IP Address is assigned when you deploy a LoadBalancer. If you want to make service available internally to your Org Network only, you have to deploy it as NodePort:

# kubectl expose deployment webserver --port=80 --type=NodePort

Check the Service to learn the exposed port:

# kubectl get service
NAME         TYPE           CLUSTER-IP       EXTERNAL-IP    PORT(S)        AGE
webserver2   NodePort       100.71.223.221   <none>         80:30871/TCP   9h

The webserver can now be accessed on all Kubernetes Nodes on port 30871. If you want, you can configure an internal loadbalancer using the VCD UI. Just create a pool with all nodes as Node members. You can quickly get a list of all address with the following command:

kubectl get nodes -o jsonpath='{ $.items[*].status.addresses[?(@.type=="InternalIP")].address }'
192.168.0.13
192.168.0.15
192.168.0.14

Persistent Storage with Name Disks

Persistent Disks in TKGm are using the Named Disks feature in Cloud Director. The CSI driver documentation is available here. Communication with Cloud Director is already prepared by the Cloud Provider, so you just have to define a Storage Class. A storage class can be defined with the following YAML file. Please change storageProfile to fit your environment or just use "*" to use the default class.

sc_gold.yaml

apiVersion: v1
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  annotations:
    storageclass.kubernetes.io/is-default-class: "false"
  name: gold
provisioner: named-disk.csi.cloud-director.vmware.com
reclaimPolicy: Delete
parameters:
  storageProfile: "StorageGold"
  filesystem: "ext4"

Then apply the yaml:

# kubectl apply -f sc.yaml
storageclass.storage.k8s.io/gold created

Now create and apply a PVC and Debugger Pod that mounts the PVC to /data
pvc.yaml

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: testpvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: "gold"
---
kind: Pod
apiVersion: v1
metadata:
  name: debugger
spec:
  volumes:
    - name: testpvc
      persistentVolumeClaim:
       claimName: testpvc
  containers:
    - name: debugger
      image: busybox
      command: ['sleep', '3600']
      volumeMounts:
        - mountPath: "/data"
          name: testpvc

You can find the PVC in Cloud Director > Data Centers > Datacenter > Storage > Named Disks

Run a shell in the pod to verify that the PVC has been mounted successfully:

# kubectl exec -it debugger -- sh
/ # df -h /data
Filesystem  Size    Used  Available  Use%  Mounted on
/dev/sdb    975.9M  2.5M  906.2M     0%    /data
/ # touch /data/hello
/ # ls -l /data
total 16
-rw-r--r--  1 root root     0 Nov 21 20:56 hello
drwx------  2 root root 16384 Nov 21 20:51 lost+found

1 thought on “First Steps with TKGm Guest Clusters in VMware Cloud Director 10.3”

marcio January 17, 2022 at 8:11 pm
Hi,
i'm having the following problem:
```
# kubectl get events
LAST SEEN TYPE REASON OBJECT MESSAGE
20s Warning FailedScheduling pod/webserver-559b886555-p4sjc 0/3 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate.
```
I look for evidence in the pod logs (vmware-cloud-director-ccm-XXXX) and i had received this content:
```
I0117 13:02:33.635772 1 cloud.go:92] Error initializing client from secrets: [unable to get swagger client from secrets: [unable to get bearer token from serets: [failed to set authorization header: [error finding LoginUrl: could not find valid version for login: could not retrieve supported versions: error fetching versions: [ParseErr]: error parsing error body for non-200 request: XML syntax error on line 6: element closed by (&{Status:400 Bad Request StatusCode:400 Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[Content-Length:[167] Content-Type:[text/html] Date:[Mon, 17 Jan 2022 13:02:33 GMT] Server:[NSX LB]] Body:0xc00048f300 ContentLength:167 TransferEncoding:[] Close:true Uncompressed:false Trailer:map[] Request:0xc000676000 TLS:0xc0003051e0})]]]]
```
Are you have any tips to indicate me?