Provision RKE Cluster with Rancher (Custom Provider) + Disk with VMware Vsphere

Dounpct
12 min readDec 3, 2022

--

In this Article I will show about how to create RKE Cluster with Rancher in Custom Mode that we will create vm node from terraform or you can create vm node by other method and so on. Rancher will listen to each node to build rke cluster.

Objective

  • provision 4 vm for rke cluster with Terraform (2 master and 2 worker)
  • set up rke cluster with rancher
  • access rke with local machine
  • provision Disk with VMware vSphere
  • test create with Statefulset
  • extend disk
  • create more StroageClass for disktype : xfs

Prerequisite

Provision 4 vm for rke cluster with Terraform (2 master and 2 worker)

git clone https://github.com/dounpct/terraform-vm-rke-custom.git

fill 4 VM name and ip for RKE Cluster

virtual_machines = {
tt2d-jiw-test-rke-master-c01 = {
ip = ""
}
tt2d-jiw-test-rke-master-c02 = {
ip = ""
}
tt2d-jiw-test-rke-worker-c01 = {
ip = ""
}
tt2d-jiw-test-rke-worker-c02 = {
ip = ""
}
}
terraform init
terraform plan
terraform apply
  • we can separate external tfvars that contain secret out of version control such as vsphere_password, vm_password and so on then we can run terraform with override var file
  • Example of terraform-rke-custom.tfvars
vsphere_server      = "10.10.10.1"
vsphere_user = "jiw@vsphere.local"

vsphere_datacenter = "DC-01"
vsphere_datastore = "DS01_PROD"
vsphere_cluster = "D3P-01"
vsphere_pool = "POOL-PROD-01"
vsphere_network = "DVS_PROD_APP_VL001_10.100.100.0"

virtual_template = "tt2d-jiw-test-ubuntu-template-02"
vm_cpu = "4"
vm_memory = "8192"

network_gateway = "10.100.100.1"
network_netmask = "23"
host_domain = "domain.local"

vm_user = "jiw"

vsphere_password = "12345678"
vm_password = "12345678"

virtual_machines = {
tt2d-jiw-test-rke-master-c01 = {
ip = "10.100.100.21"
}
tt2d-jiw-test-rke-master-c02 = {
ip = "10.100.100.22"
}
tt2d-jiw-test-rke-worker-c01 = {
ip = "10.100.100.23"
}
tt2d-jiw-test-rke-worker-c02 = {
ip = "10.100.100.24"
}
}

dns_server_list = ["10.100.100.1","10.100.100.2"]
terraform apply -var-file="/mnt/d/work-github/tfvar-secret/terraform-rke-custom.tfvars"
  • wait to provision
  • you can ssh from Rancher Manger to 4 vm for rke cluster (without password need)

Set up rke cluster with rancher

  • login to rancher
  • click Cluster Management
  • click Create
  • Custom
  • Cluster Name : jiw-test-rke-uc-01
  • Cloud Provider : External (Out-of-tree)

this is for prepare to working for provision Disk with VMware vSphere

if you don’t need to have Disk on cluster then set Cloud Provider : None

  • *** change to kube version 1.23 *** to support more chart such as vsphere csi
  • Next
  • check for etcd and Control Plane and Copy Command to VM tt2d-jiw-test-rke-master-c01 , tt2d-jiw-test-rke-master-c02
  • check for only Worker and Copy Command to VM tt2d-jiw-test-rke-worker-c01 , tt2d-jiw-test-rke-worker-c02
  • Done
  • wait for Provision RKE Cluster
  • we can see log
  • now you can survey your rke cluster with GUI Rancher

Access rke with local machine

  • download KubeConfig
  • copy file to your working directory
  • access to your cluster
export KUBECONFIG=jiw-test-rke-uc-01.yaml
kubectl get node
kubectl get ns
  • when you have many cluster you maycreate script to connect your cluster
cat > connect-cluster-jiw-test-rke-uc-01.bash <<EOF
SERVER=https://rancher-mgmt01.domain.local/k8s/clusters/c-dscbq
TOKEN=kubeconfig-u-657dnm3zop7d99z:585xxz2mbx6pt75j57vp45hzwprlkblbgtwc2blrzn9fb964wb5s4k
CLUSTER=jiw-test-rke-uc-01

kubectl config set-cluster \$CLUSTER --server=\$SERVER --insecure-skip-tls-verify
kubectl config set-credentials \$CLUSTER --token=\$TOKEN
kubectl config set-context \$CLUSTER --cluster=\$CLUSTER --user=\$CLUSTER
kubectl config use-context \$CLUSTER

EOF
  • SERVER TOKEN CLUSTER can find from jiw-test-rke-uc-01.yaml that we have already download
./connect-cluster-jiw-test-rke-uc-01.bash

Provision Disk with VMware vSphere

  • install vSphere CPI → Apps & Markets → Charts
  • install
  • check for CPI
kubectl describe nodes | grep "ProviderID"
  • if you don’t get ProviderID on every node please check log.it may about connection to your vsphere vcenter.
  • install vSphere CSI → Apps & Markets → Charts
  • there is not vSphere CSI in rancher 2.6.9 but there is vSphere CSI in rancher 2.6.3
  • so I have change Charts Repository from release-v2.6 to release-v2.6.2
  • then there is vSphere CSI
  • install CSI Chart Versions 100.0.0 and App Version 2.2.0
  • we can resize disk only in vSphere 7+
  • Storage Policy Name: leave to auto generate
  • Data Store URL: can leave and set to limit at Kind StorageClass or we can set to limit here

Note

  • I have test on CSI in Chart Versions 100.2.0 .there is check box for Allow Volume Expansion
  • Why we don’t see version for CSI in Chart Versions 100.2.0 or above

because we may install rke with new version (more than 1.24 and install rancher at stable version 2.6.9)

solution : install rke version v1.23.12-rancher1–1 because rancher chart stable still how version v2.6.9 now and some chart can not use when we install kubernetes version more than 1.24

refer : Rancher Cluster (Set up rke cluster)

this is for version: rancher-vsphere-csi:100.3.0+up2.5.1-rancher1

Test create with Statefulset

  • create test ns
kubectl create ns jiw-test
kubectl apply -n jiw-test -f - <<EOF
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
selector:
app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
serviceName: "nginx"
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 1Gi
EOF
  • check pvc create
kubectl get pvc -n jiw-test
  • check resource and show worker node attack disk
kubectl get all -n jiw-test -o wide
  • note : this show disk attack on tt2d-jiw-test-rke-worker-c02
  • vm in vsphere will show additional disk
  • show disk type (ext4)
kubectl get pv
kubectl get pv/pvc-cea73728-7969-4f4d-be10-c10ce05fb28a -o yaml
  • test for disk size now
kubectl exec -it -n jiw-test pod/web-0 -- df -h /usr/share/nginx/html

Extend disk

  • stop active pvc
kubectl scale --replicas=0 statefulset.apps/web -n jiw-test
  • increate pvc size
kubectl patch pvc/www-web-0 -n jiw-test -p '{"spec": {"resources": {"requests": {"storage": "2Gi"}}}}'
  • if no error it mean can resize pvc then
  • reactive pvc
kubectl scale --replicas=1 statefulset.apps/web -n jiw-test
  • check pvc again it will show CAPACITY 2 Gi
kubectl get pvc -n jiw-test
kubectl exec -it -n jiw-test pod/web-0 -- df -h /usr/share/nginx/html
  • if error
    Error from server (Forbidden): persistentvolumeclaims “www-web-0” is forbidden: only dynamically provisioned pvc can be resized and the storageclass that provisions the pvc must support resize

if mean not support from csi verion or vsphere version or storageclass

  • test new StorageClass
  • check from yaml
  • set New StorageClass for Default StorageClass
  • create new statefulset with StroageClass : vsphere-csi-sc-extend
kubectl delete statefulset.apps/web -n jiw-test
kubectl delete pvc/www-web-0 -n jiw-test
kubectl apply -n jiw-test -f - <<EOF
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
selector:
app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
serviceName: "nginx"
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "vsphere-csi-sc-extend"
resources:
requests:
storage: 1Gi
EOF

Create StorageClass for disk type xfs

  • clone StorageClass and add more configuration

csi.storage.k8s.io/fstype: xfs

  • set remove default with StorageClass
  • create statefulset
kubectl apply -n jiw-test -f - <<EOF
apiVersion: v1
kind: Service
metadata:
name: nginx-xfs
labels:
app: nginx-xfs
spec:
ports:
- port: 80
name: web
selector:
app: nginx-xfs
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web-xfs
spec:
serviceName: "nginx-xfs"
replicas: 1
selector:
matchLabels:
app: nginx-xfs
template:
metadata:
labels:
app: nginx-xfs
spec:
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "vsphere-csi-sc-extend-xfs"
resources:
requests:
storage: 1Gi
EOF
  • note if your pvc can not create disk from vcenter

check storage class for test remove datastoreURL because datastoreURL that we have set may can’t provision as type xfs

Extend disk with online extension

I have show you about how to extend disk above by offline extension that support from vSphere version 7+ but if your version 7.0 U2+ you may can online extension it mean you don’t need to stop pod or scale your statefulset to 0 before upgrade you disk

  • in step install CSI check Enable Online Volume Extend
kubectl get pvc -n jiw-test
kubectl get pv
kubectl exec -it -n jiw-test pod/web-0 -- df -h /usr/share/nginx/html

kubectl patch pvc/www-web-0 -n jiw-test -p '{"spec": {"resources": {"requests": {"storage": "5Gi"}}}}'

#wait for csi extend pvc
kubectl get pvc -n jiw-test
kubectl get pv
kubectl exec -it -n jiw-test pod/web-0 -- df -h /usr/share/nginx/html

kubectl describe pvc/www-web-0 -n jiw-test

Troubleshooting

Can not get ProviderID in step install CPI vSphere

if you not check cloud_provider to external since install cluster and you edit cluster config for later

cloud_provider:
name: external

you may need to taint all node with node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule

#!/bin/bash
export KUBECONFIG=$1
for node in $(kubectl get nodes | awk '{print $1}' | tail -n +2)
do
kubectl taint node $node node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule-
done

after all node taint with cloudprovider test uninstall and install vSphere CPI again and all node will auto remove taint

Can create PVC PV but Pod can not attack disk

  • try to check vm that have

if you create vm from Terraform you need to have

enable_disk_uuid = true

— — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Credit : TrueDigitalGroup

— — — — — — — — — — — — — — — — — — — — — — — — — — — — —

--

--

Dounpct
Dounpct

Written by Dounpct

I work for TrueDigitalGroup in DevOps x Automation Team

No responses yet