Kubernetes Upgrade Checklist (Minor Version): the runbook I wish I had
I’ve watched “small” Kubernetes minor upgrades take down perfectly healthy apps. It usually starts with one removed API, one too-old CNI, or one webhook nobody owns anymore.
Our Methodology
This guide is based on official Kubernetes documentation and release notes, analysis of relevant GitHub commits, hands-on community testing and incident postmortems, vendor compatibility matrices for CNIs/CSIs, and the author’s real-world upgrade experience. We cross-checked behaviors in both kubeadm and managed-service workflows to highlight common failure modes and reliable mitigations.
What a minor upgrade really changes
Minor upgrades change the middle number in x.y.z (for example, 1.33 to 1.34). This is where Kubernetes removes APIs, shifts defaults, and tightens compatibility rules, so treat it like a compatibility audit, not a version bump.
- Do not skip minor versions (especially for the API server): Plan 1.33 to 1.34 to 1.35, not 1.33 to 1.35, unless your provider gives you a supported exception and you can prove it in a canary.
- Respect version skew: Keep kubelet behind the API server, keep control plane components aligned to the API server minor, and keep HA API servers close together. If you violate skew mid-upgrade, you get the “works, but unstable” cluster that wastes weekends.
The 30-second checklist (print this)
🔔 Never Miss a Breaking Change
Get weekly release intelligence — breaking changes, security patches, and upgrade guides before they break your build.
✅ You're in! Check your inbox for confirmation.
Read this twice. Then start.
- Pick a target minor and its latest patch: Upgrade to the newest patch on your current minor first, then the newest patch on the target minor.
- Audit removed and deprecated APIs: Fix manifests and controllers before touching the control plane.
- Preflight add-on compatibility: Confirm CNI, CSI, CoreDNS, kube-proxy, ingress, and any service mesh support the target minor.
- Upgrade control plane first: Then upgrade nodes, while keeping skew valid.
- Drain before minor kubelet upgrades: Do not “wing it” with in-place node upgrades.
- Validate after each step: Stop early when you see a trend, not after everything breaks.
Step 0: pick rolling vs blue/green (this is a business decision)
Most teams default to rolling because it feels cheaper. That bit me once when an admission webhook started failing only after the API server restarted, and the cluster refused every create request. We ended up doing a blue/green migration anyway, just under pressure.
- Rolling upgrade: Upgrade control plane, then nodes one-by-one or pool-by-pool. Use this when you can tolerate slow, predictable disruption and you have capacity headroom.
- Blue/green upgrade: Bring up a new cluster on the target version, migrate workloads, then cut traffic. Use this when API removals loom, webhooks feel fragile, or your CNI/CSI upgrade needs careful timing.
Phase 1: pre-upgrade planning (the part that saves you)
So. This is where upgrades succeed or fail, and it is mostly boring work.
- Choose your target version while you still have patch support: Kubernetes only maintains release branches for a limited set of recent minors. If you run far behind, you stack risk and reduce your escape routes.
- Read release notes like you are looking for landmines: Scan for API removals, default changes, and “known issues” that mention your CNI, CSI, ingress, or autoscaler.
- Run a removed-API audit before the change window: Hunt in Git (Helm/Kustomize/YAML), then in the live cluster. Your first “no matches for kind” error should happen in CI, not at 2 a.m.
- Inventory versions to avoid skew traps: Write down API server, controller-manager, scheduler, kubelets, kube-proxy, and the kubectl version your automation uses.
- Validate add-ons early (CNI first): I do not trust “it probably supports it.” Get the vendor compatibility matrix for your exact CNI/CSI versions.
- Backups plus a real rollback trigger: Snapshot etcd or confirm what rollback even means in your managed service. Then write a stop condition you will obey, like “API 5xx above 1% for 10 minutes” or “new pods cannot get an IP for 5 minutes.”
If you cannot describe your rollback in two sentences, you do not have a rollback plan.
Phase 2: execution (kubeadm path)
This path stays boring if you keep it boring. Upgrade one thing at a time and verify after each move.
- Upgrade the first control plane node: Upgrade kubeadm, run kubeadm upgrade plan, then run kubeadm upgrade apply v1.y.z on the primary control plane node.
- Drain, upgrade kubelet and kubectl, restart, uncordon: Run kubectl drain <node> -ignore-daemonsets, upgrade kubelet/kubectl to the target patch, restart kubelet, then kubectl uncordon <node>.
- Upgrade remaining control plane nodes one-by-one: Do not parallelize control plane upgrades unless you enjoy debugging split-brain behavior.
- Upgrade worker nodes in small batches: Cordon and drain, upgrade kubelet, restart, uncordon, then watch workloads reschedule cleanly before you touch the next batch.
Phase 2: execution (managed Kubernetes path)
Managed does not mean hands-off. It means you outsource the control plane upgrade mechanics, not the blast radius.
- Upgrade the control plane first: Use your provider operation to move the control plane to v1.y.
- Upgrade node pools next: Keep kubelet skew within policy. If you run mixed pools, upgrade the least critical pool first as a canary.
- Upgrade provider add-ons at the right time: Your CNI/CSI/CoreDNS/kube-proxy equivalents often require explicit updates.
- Validate after every pool: Do not wait for “all pools done” to discover DNS broke.
Phase 3: post-upgrade validation (catch the quiet failures)
Quiet failures hurt more than loud ones. A broken admission webhook can let the cluster look fine until your next deployment.
- Nodes: All nodes show Ready, and versions stay inside the allowed skew boundaries.
- kube-system: DNS, networking, and metrics pods stay stable. Watch for restarts that line up with your upgrade steps.
- Workloads: Critical services pass readiness checks, and your SLO dashboard stays boring.
- API and admission: Create and update a few real objects. Webhook failures often hide until you do a write.
- Events and logs: Search for “no matches for kind” and similar errors that scream “removed API.”
What actually breaks (and what it looks like at 2 a.m.)
I don’t trust “known issues: none” from any project. I trust symptoms.
- Removed API usage: Helm upgrades fail, controllers crash-loop, kubectl apply returns “no matches for kind.” Fix the manifests and controller client libraries before you retry the upgrade.
- Version skew violations: Kubelets refuse to register, control plane components behave oddly, and the cluster feels unstable. Back up, stop, and bring versions back into a supported range.
- CNI/CSI mismatch: Pods stick in Pending, nodes show Ready but have no networking, or volumes refuse to attach. Upgrade the add-on to a compatible version, and verify daemonsets roll across every node.
FAQ
Can you skip minor versions when upgrading Kubernetes? For kubeadm upgrades, upstream docs say skipping minor versions is unsupported. In managed services, providers sometimes abstract this, but you still pay the compatibility cost, so test like you are skipping even if the UI lets you click through.
Do you upgrade the control plane or worker nodes first? Upgrade the control plane first, then workers. This keeps kubelet and control plane components inside supported skew rules.
Do you need to drain nodes for a minor version upgrade? Drain nodes before minor kubelet upgrades. I’ve seen teams skip drains “just this once,” then spend an hour untangling stuck pods and PDB blocks.
How do I know which APIs are removed in my target version? Use the Kubernetes Deprecated API Migration Guide for the target release, then audit both Git and live objects.
Should I upgrade to the latest patch of my current version first? Yes. It reduces surprise by picking up bug fixes before you change a minor boundary.