Kubernetes v1.35.0-rc.0: test it like you mean it
I do not trust “works in dev” upgrades.
I’ve watched teams “validate” a Kubernetes release candidate by booting a cluster, deploying nginx, and calling it done, then they hit a real upgrade week later and everything that matters breaks: GitOps drift, CSI attach timeouts, or a controller that cannot handle a subtle API behavior change.
What v1.35.0-rc.0 is for (and what it isn’t)
This RC exists so you can run your exact upgrade path in a safe place.
That means staging, a canary cluster, or a throwaway environment that looks like prod: same CNI, same CSI driver, same admission stack, same operators, same Pod Security settings, same GitOps tool. If you cannot mirror that, you can still test, but you should lower your expectations and focus on API and manifest compatibility.
- Use it to rehearse: your kubeadm or provider upgrade steps, your node drain strategy, and your rollback plan, including the part where you restore etcd and discover you forgot to test the restore.
- Do not use it for “just one tiny change”: RCs can still carry bugs. Treat it like a production change, but execute it in staging.
The thing nobody mentions: your automation usually fails first
So.
The scary failures I see during RC testing rarely come from “Kubernetes is broken.” They come from our stuff: a Helm chart that assumes a default API version, a policy webhook with a brittle match rule, a cluster-autoscaler setting that depended on behavior nobody documented, or a homegrown upgrade script that silently ignores a warning.
Ignore the GitHub commit count. It’s a vanity metric. Your risk depends on what changed in the subsystems you rely on.
A staging validation checklist I’d actually run
đź”” Never Miss a Breaking Change
Get weekly release intelligence — breaking changes, security patches, and upgrade guides before they break your build.
âś… You're in! Check your inbox for confirmation.
Start with a clock and a stop button.
Pick a two-hour window, decide what “abort” means before you start, and write it down. If your cluster runs customer traffic, test this twice. If this is a dev platform that you can rebuild from scratch, you can be more casual.
- Inventory what you run: list your CNI (Calico, Cilium, etc.), CSI driver, Ingress controller, service mesh (if any), and your top 10 operators and CRDs. Those components break upgrades more than core Pods do.
- Take a real backup: snapshot etcd (or follow your provider’s backup path), then do the annoying part: restore it into a scratch control plane to prove you can. This bit me when we “had backups” that nobody had restored in six months.
- Upgrade as a canary first: upgrade control plane, then one worker node pool, then run tests. Keep the rest untouched so you can compare behavior and metrics side-by-side.
- Run basic cluster health checks: confirm nodes go Ready, CoreDNS stays stable, and kube-system Pods do not flap. Watch events while you drain and uncordon.
- Test scheduling and churn: create and delete a few hundred short-lived Pods, scale a Deployment from 2 to 200 and back, and confirm your HPA and PDB behavior still matches what you expect.
- Test storage like you care: create a new PVC, mount it, write data, delete and re-create a Pod, then confirm the data survives. If you run StatefulSets, roll one and verify readiness gates and startup times.
- Test networking end-to-end: pod-to-pod, pod-to-service, pod-to-external, and ingress. If you use NetworkPolicies, verify one “deny” rule and one “allow” rule with real traffic.
- Test admission and auth: apply a manifest that should pass, then apply one that should fail, and confirm you get the failure you expect. Admission regressions feel like “kubectl is haunted” until you look at webhook logs.
- Run your real pipelines: run the GitOps sync, run Helm upgrades, run your operator reconciliation, and watch for drifts and endless reconcile loops.
Rollback: the part you will wish you practiced
Rollbacks look easy on slides.
In reality, you might roll back the control plane, then discover a node component or an addon already wrote state in a way the old version hates. Storage adds extra spice. Some folks skip rollback tests for patch releases. I don’t, but I get it if you have a rebuildable staging cluster and zero state.
- Define abort signals: repeated kube-apiserver restarts, a flood of admission failures, CSI attach/detach errors that do not clear, or nodes that cannot rejoin after drain.
- Practice at least one rollback path: either restore from snapshot (etcd or provider backup) or rebuild the cluster and restore workloads from GitOps and backups. Pick one. Half-plans fail when you need them.
How to report something the Kubernetes community can fix
Be boring. Be specific.
If you file “upgrade broke stuff,” nobody can act on it. If you file “v1.35.0-rc.0 + Cilium X.Y + kubeadm upgrade from 1.34.Z causes webhook timeouts, repro in 6 commands,” maintainers can jump in.
- Include versions: control plane version, kubelet version, kubectl version, OS/kernel, CNI/CSI versions, and the exact upgrade path (for example, kubeadm, kOps, or a managed provider).
- Include artifacts: kube-apiserver logs, kubelet logs for one failing node, events from the affected namespace, and the smallest manifest that reproduces the issue.
- Link the upstream release notes: start from the v1.35.0-rc.0 GitHub release page and reference any related PRs from CHANGELOG-1.35.md.
Official release notes
Read the upstream notes before you schedule the test.
Use the official GitHub release page for v1.35.0-rc.0 and the CHANGELOG-1.35.md file for the real list of changes, known issues, and links to PRs. Other stuff in this release: dependency bumps, some image updates, the usual.
If you cannot test your CNI in staging, you should not be running Kubernetes.
Anyway.