Best reliability designs for Google Kubernetes Engine (GKE)

6 min readJul 22, 2021

In designing a highly available and resilient GKE application, I came across configurations that really add to the reliability level of a GKE cluster. I detailed out the setting and configuration to audit the status and how to remediate

Utilizes Regional Clusters with multiple control plane nodes

My Rationale:

GKE provides two options for master nodes (control plane) management. Zonal cluster will have one master node in one zone, hence it cannot support zonal level high availability. Regional clusters will have multiple master nodes across zones in a region and it will support higher availability.

For production, it is recommended to use a regional cluster. For more detail.

Audit:

cloud container clusters list
gcloud container clusters describe <cluster-name> --region <gcp-region>

Remediation:

gcloud container clusters create **[my-regional-cluster]** \\
-region **[compute-region]**

Provisions GKE cluster nodes in multiple zones

My Rationale:

When using regional GKE clusters, cluster nodes also run in each zone where a replica of the control plane runs by default (through Node Pool). Make sure to accept either the default option or explicitly specify the multiple zones when creating a regional cluster. If zones are specified manually, you should select at least three.

Audit:

gcloud container clusters list

Check the “LOCATION” field to ensure that multiple zones are available.

Remediation:

gcloud container clusters create my-regional-cluster \\
-region [compute-region] --node-locations compute-zone,compute-zone...

Separates workloads into different node pools

My Rationale:

To avoid poor resource utilization and isolated cluster nodes upgrade and autoscaling, we recommend separating workloads into different node pools based on usage patterns or requirements for special hardware (such as TPUs or GPUs). For example, placing batch workloads and serving workloads into 2 different node pools with different node configurations. This separation is typically managed through Kubernetes Node Taint(Tolerance) and affinity (anti-affinity). To prevent tight coupling between workloads and the underlying infrastructure, do not use the node pool name as a node selector. Instead, consider using selectors that reference labels that describe the capabilities of the node pool. These can be labels managed by GCP or custom labels. For more information, reference Choose the right machine type document.

Audit:

gcloud container node-pools list --region **[us-central1]** --cluster **[cluster-name]**

Remediation:

gcloud container node-pools create pool-name --cluster cluster-name

Utilizes Cluster Nodes Autoscaler

My Rationale:

GKE’s cluster autoscaler automatically resizes clusters based on the demands of the workloads you want to run. With auto scaling enabled, GKE automatically adds a new node to node pools in your cluster if you’ve created new Pods that don’t have enough capacity to run; conversely, if a node in your cluster is underutilized and its Pods can be run on other nodes, GKE can delete the node.

This provides flexibility within the system design. For more detail . It is recommended to use Kubernetes Horizontal Pod Autoscaler (HPA) and GKE cluster autoscaling together to provide a stable cluster by scaling your application pods and cluster nodes.

If resources are deleted or moved when autoscaling your cluster, your workloads might experience transient disruption. To increase your workload’s tolerance to interruption, consider deploying your workload using a controller with multiple replicas, such as a Deployment. See detail in Application section.

Audit:

gcloud container node-pools describe node-pool-name \\
--zone us-central1-a --cluster cluster-name

Check the “autoscaling” field. (enabled: true)

Remediation:

Create a new cluster (E.G.):gcloud container clusters create
cluster-name --num-nodes 30 \\
--enable-autoscaling --min-nodes 15 --max-nodes 50 [--zone compute-zone]Update an existing node pool:gcloud container clusters update cluster-name --enable-autoscaling \\
--min-nodes 1 --max-nodes 10 --zone compute-zone --node-pool default-pool

Utilizes Cluster Nodes auto-repair

My Rationale:

GKE’s node auto-repair feature helps you keep the nodes in your cluster in a healthy, running state. When enabled, GKE makes periodic checks on the health state of each node in your cluster. If a node fails consecutive health checks over an extended time period, GKE initiates a repair process for that node.

With GKE auto-repair, node gets recreated and possibly attached to a new external IP (for non private clusters). This might cause the node not whitelisted by a third party service. Please review your IP whitelisting requirement.

Audit:

gcloud container node-pools describe node-pool-name --zone us-central1-a \\
--cluster cluster-name

Check for the “autoRepair” field (true/false).

You can check log entries for automated repair events with:

gcloud container operations list

Remediation:

When creating a new cluster:

gcloud container clusters create
cluster-name --zone compute-zone \\
--enable-autorepair

When creating a new node pool:

gcloud container node-pools create
pool-name --cluster cluster-name \\
--zone compute-zone \\
--enable-autorepair

When updating an existing node pool:

gcloud container node-pools create
pool-name --cluster cluster-name \\
--zone compute-zone \\
--enable-autorepair

Utilizes Regional Persistent Disks

My Rationale:

I recommended to build a highly available application by using regional persistent disks on GKE. Regional persistent disks provide synchronous replication between two zones, which keeps your application up and running in case there is an outage or failure in a single zone. Here are the details.

Audit:

kubectl describe storageclass storageclass-name

Then check the Parameters field has value “replication-type=regional-pd”

Remediation:

Create StorageClass to use regional persistent disks with the following example:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: repd-west1-a-b-c
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-standard
replication-type: regional-pd
zones: us-west1-a, us-west1-b, us-west1-c

GKE Networking Utilizes VPC-native cluster

My Rationale:

A GKE cluster that uses alias IP address ranges is called a VPC-native cluster, in which Pod IP address ranges do not depend on custom static routes. They do not consume the system-generated and custom static routes quota. This reduces the risks of application outage due to the route quota exhaustion. For more detail.

Audit:

Run the following command:

gcloud container clusters describe tier-1-cluster --zone us-central1-a | grep useIpAliases

Ensure that the useIpAliases value is true.

Remediation:

When creating a new cluster with user managed subnets secondary range:

gcloud container clusters create
cluster-name \\
--region=region \\
--enable-ip-alias \\
--subnetwork=subnet-name \\
--cluster-ipv4-cidr=pod-ip-range \\
--services-ipv4-cidr=services-ip-range

Applications don’t use naked Pod

My Rationale:

Don’t use naked Pods (that is, Pods not bound to a ReplicaSet or Deployment) if you can avoid it. Naked Pods will not be rescheduled in the event of a node failure.

Audit:

*kubectl get pods -n target-namespace -o json  | jq '.items[].metadata'*

Look for the Pods’ metadata without the “ownerReferences” field.

Or, spot potential nakes pods with:

kubectl get pods -n target-namespace -o json | jq '.items[].metadata.ownerReferences'

Then looking for the ‘null’ entry for naked pods.

Note: Don’t run this against the GKE system namespaces such as kube-system because static pods may run there.

Remediation:

Advise development team to use Deployment, ReplicaSet or StatefulSet components.

CI/CD pipeline for GKE components can check for naked pods definition by looking for the following in the kubernetes manifest:

apiVersion: v1

kind: Pod

Application Utilizes Pod anti-affinity

My Rationale:

To avoid a single point of failure, use Pod anti-affinity to instruct Kubernetes NOT to co-locate Pods on the same node. For a stateful application, this can be a crucial configuration, especially if it requires a minimum number of replicas (i.e., a quorum) to run properly.

Audit:

Node affinity is specified as field nodeAffinity of field affinity in the PodSpec.

Specifically, define the anti-affinity topology-key field as either:

topologyKey: topology.kubernetes.io/zone

topologyKey: topology.kubernetes.io/region

Remediation:

Advise development to define the Node anti-affinity for critical workloads.

Additional design considerations

For more additional general recommendations, take a look at this Google Cloud blog post GKE best practices: Designing and building highly available clusters

Best reliability designs for Google Kubernetes Engine (GKE)

Utilizes Regional Clusters with multiple control plane nodes

Provisions GKE cluster nodes in multiple zones

Separates workloads into different node pools

Utilizes Cluster Nodes Autoscaler

Utilizes Cluster Nodes auto-repair

Utilizes Regional Persistent Disks

GKE Networking Utilizes VPC-native cluster

Applications don’t use naked Pod

Application Utilizes Pod anti-affinity

Additional design considerations

Written by Scott Dallman

No responses yet