Best reliability designs for Google Kubernetes Engine (GKE)
In designing a highly available and resilient GKE application, I came across configurations that really add to the reliability level of a GKE cluster. I detailed out the setting and configuration to audit the status and how to remediate
Utilizes Regional Clusters with multiple control plane nodes
My Rationale:
GKE provides two options for master nodes (control plane) management. Zonal cluster will have one master node in one zone, hence it cannot support zonal level high availability. Regional clusters will have multiple master nodes across zones in a region and it will support higher availability.
For production, it is recommended to use a regional cluster. For more detail.
Audit:
cloud container clusters list
gcloud container clusters describe <cluster-name> --region <gcp-region>
Remediation:
gcloud container clusters create **[my-regional-cluster]** \\
-region **[compute-region]**
Provisions GKE cluster nodes in multiple zones
My Rationale:
When using regional GKE clusters, cluster nodes also run in each zone where a replica of the control plane runs by default (through Node Pool). Make sure to accept either the default option or explicitly specify the multiple zones when creating a regional cluster. If zones are specified manually, you should select at least three.
Audit:
gcloud container clusters list
Check the “LOCATION” field to ensure that multiple zones are available.
Remediation:
gcloud container clusters create my-regional-cluster \\
-region [compute-region] --node-locations compute-zone,compute-zone...
Separates workloads into different node pools
My Rationale:
To avoid poor resource utilization and isolated cluster nodes upgrade and autoscaling, we recommend separating workloads into different node pools based on usage patterns or requirements for special hardware (such as TPUs or GPUs). For example, placing batch workloads and serving workloads into 2 different node pools with different node configurations. This separation is typically managed through Kubernetes Node Taint(Tolerance) and affinity (anti-affinity). To prevent tight coupling between workloads and the underlying infrastructure, do not use the node pool name as a node selector. Instead, consider using selectors that reference labels that describe the capabilities of the node pool. These can be labels managed by GCP or custom labels. For more information, reference Choose the right machine type document.
Audit:
gcloud container node-pools list --region **[us-central1]** --cluster **[cluster-name]**
Remediation:
gcloud container node-pools create pool-name --cluster cluster-name
Utilizes Cluster Nodes Autoscaler
My Rationale:
GKE’s cluster autoscaler automatically resizes clusters based on the demands of the workloads you want to run. With auto scaling enabled, GKE automatically adds a new node to node pools in your cluster if you’ve created new Pods that don’t have enough capacity to run; conversely, if a node in your cluster is underutilized and its Pods can be run on other nodes, GKE can delete the node.
This provides flexibility within the system design. For more detail . It is recommended to use Kubernetes Horizontal Pod Autoscaler (HPA) and GKE cluster autoscaling together to provide a stable cluster by scaling your application pods and cluster nodes.
If resources are deleted or moved when autoscaling your cluster, your workloads might experience transient disruption. To increase your workload’s tolerance to interruption, consider deploying your workload using a controller with multiple replicas, such as a Deployment. See detail in Application section.
Audit:
gcloud container node-pools describe node-pool-name \\
--zone us-central1-a --cluster cluster-name
Check the “autoscaling” field. (enabled: true)
Remediation:
Create a new cluster (E.G.):gcloud container clusters create
cluster-name --num-nodes 30 \\
--enable-autoscaling --min-nodes 15 --max-nodes 50 [--zone compute-zone]Update an existing node pool:gcloud container clusters update cluster-name --enable-autoscaling \\
--min-nodes 1 --max-nodes 10 --zone compute-zone --node-pool default-pool
Utilizes Cluster Nodes auto-repair
My Rationale:
GKE’s node auto-repair feature helps you keep the nodes in your cluster in a healthy, running state. When enabled, GKE makes periodic checks on the health state of each node in your cluster. If a node fails consecutive health checks over an extended time period, GKE initiates a repair process for that node.
With GKE auto-repair, node gets recreated and possibly attached to a new external IP (for non private clusters). This might cause the node not whitelisted by a third party service. Please review your IP whitelisting requirement.
Audit:
gcloud container node-pools describe node-pool-name --zone us-central1-a \\
--cluster cluster-name
Check for the “autoRepair” field (true/false).
You can check log entries for automated repair events with:
gcloud container operations list
Remediation:
When creating a new cluster:
gcloud container clusters create
cluster-name --zone compute-zone \\
--enable-autorepair
When creating a new node pool:
gcloud container node-pools create
pool-name --cluster cluster-name \\
--zone compute-zone \\
--enable-autorepair
When updating an existing node pool:
gcloud container node-pools create
pool-name --cluster cluster-name \\
--zone compute-zone \\
--enable-autorepair
Utilizes Regional Persistent Disks
My Rationale:
I recommended to build a highly available application by using regional persistent disks on GKE. Regional persistent disks provide synchronous replication between two zones, which keeps your application up and running in case there is an outage or failure in a single zone. Here are the details.
Audit:
kubectl describe storageclass storageclass-name
Then check the Parameters field has value “replication-type=regional-pd”
Remediation:
Create StorageClass to use regional persistent disks with the following example:
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: repd-west1-a-b-c
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-standard
replication-type: regional-pd
zones: us-west1-a, us-west1-b, us-west1-c
GKE Networking Utilizes VPC-native cluster
My Rationale:
A GKE cluster that uses alias IP address ranges is called a VPC-native cluster, in which Pod IP address ranges do not depend on custom static routes. They do not consume the system-generated and custom static routes quota. This reduces the risks of application outage due to the route quota exhaustion. For more detail.
Audit:
Run the following command:
gcloud container clusters describe tier-1-cluster --zone us-central1-a | grep useIpAliases
Ensure that the useIpAliases value is true.
Remediation:
When creating a new cluster with user managed subnets secondary range:
gcloud container clusters create
cluster-name \\
--region=region \\
--enable-ip-alias \\
--subnetwork=subnet-name \\
--cluster-ipv4-cidr=pod-ip-range \\
--services-ipv4-cidr=services-ip-range
Applications don’t use naked Pod
My Rationale:
Don’t use naked Pods (that is, Pods not bound to a ReplicaSet or Deployment) if you can avoid it. Naked Pods will not be rescheduled in the event of a node failure.
Audit:
*kubectl get pods -n target-namespace -o json | jq '.items[].metadata'*
Look for the Pods’ metadata without the “ownerReferences” field.
Or, spot potential nakes pods with:
kubectl get pods -n target-namespace -o json | jq '.items[].metadata.ownerReferences'
Then looking for the ‘null’ entry for naked pods.
Note: Don’t run this against the GKE system namespaces such as kube-system because static pods may run there.
Remediation:
Advise development team to use Deployment, ReplicaSet or StatefulSet components.
CI/CD pipeline for GKE components can check for naked pods definition by looking for the following in the kubernetes manifest:
apiVersion: v1
kind: Pod
Application Utilizes Pod anti-affinity
My Rationale:
To avoid a single point of failure, use Pod anti-affinity to instruct Kubernetes NOT to co-locate Pods on the same node. For a stateful application, this can be a crucial configuration, especially if it requires a minimum number of replicas (i.e., a quorum) to run properly.
Audit:
Node affinity is specified as field nodeAffinity of field affinity in the PodSpec.
Specifically, define the anti-affinity topology-key field as either:
topologyKey: topology.kubernetes.io/zone
topologyKey: topology.kubernetes.io/region
Remediation:
Advise development to define the Node anti-affinity for critical workloads.
Additional design considerations
For more additional general recommendations, take a look at this Google Cloud blog post GKE best practices: Designing and building highly available clusters