Kubernetes 1.35 release: the stuff that can break your cluster
I’ve watched “minor” Kubernetes upgrades knock perfectly healthy node pools offline. Kubernetes 1.35, released December 17, 2025, has that kind of edge because cgroup v1 nodes stop working under the default kubelet behavior.
What I’d check first (before you read the shiny features)
Do this now. You will thank yourself later when the upgrade window gets tight and someone asks “why are half the nodes NotReady?”
- Check cgroup mode on every node: Run stat -fc %T /sys/fs/cgroup/. If you see tmpfs, that node uses cgroup v1. If you see cgroup2fs, you already run cgroup v2.
- Check your container runtime version: Run crictl version and record the containerd version. If you still run containerd 1.x, schedule time to move. Do not wait until your next Kubernetes upgrade forces it.
- List “weird” kubelet cert setups: If you ever hand-rolled kubelet serving certs, write them down. Future certificate validation hardening tends to break custom naming schemes first.
cgroup v1 in 1.35: the default changed, and it bites
Here’s the thing nobody mentions in the celebratory release blogs. Your cluster can look fine in staging because your staging AMI uses cgroup v2, then production falls over because you still have an old node image in one pool.
Kubernetes 1.35 changes kubelet’s default behavior around cgroup v1 via the failCgroupV1 setting. In practical terms, cgroup v1 nodes can fail to start kubelet under the default configuration. Treat that as a breaking change, because it is one.
If you are not 100% sure every node runs cgroup v2, do not “just upgrade.” Canary one node pool first.
Some folks try to power through by flipping an opt-out knob and moving on. I get it. But you will still need a real migration plan because the ecosystem keeps dropping v1 support, and Kubernetes follows the ecosystem, not the other way around.
A rollout pattern that usually keeps you out of trouble
- Build a fresh node pool: Use an OS image you know boots with cgroup v2, then join it to the cluster.
- Canary one node: Cordon and drain one old node, move a low-stakes workload, then watch kubelet logs and node readiness for a full business day.
- Scale by pool, not by individual nodes: If the canary behaves, migrate pool-by-pool. If it fails, you want a clean blast radius and a clean rollback.
In-place Pod resize is GA. Don’t treat that as “safe for everything.”
🔔 Never Miss a Breaking Change
Get weekly release intelligence — breaking changes, security patches, and upgrade guides before they break your build.
✅ You're in! Check your inbox for confirmation.
This feature solves a real pain. Stateful workloads hate restarts, and I’ve seen teams postpone right-sizing for months because every resize meant a bounce and a paging rotation.
Kubernetes 1.35 graduates in-place Pod resize to stable, which means you can adjust CPU and memory on a running Pod instead of doing the terminate-and-recreate shuffle. That helps, especially for StatefulSets that you do not want to restart during a busy hour.
- Where it shines: A Postgres Pod that hits memory pressure during end-of-month jobs. You bump memory, keep connections, and avoid a restart.
- Where it can hurt: Memory decreases and CPU decreases can expose sloppy limits tuning. If you shrink too far, the kernel will not negotiate. It will kill things.
So. Test it with a workload that screams when it gets tight. I usually start with Redis or Postgres under synthetic load and watch OOMKills and latency like a hawk.
Gang scheduling (alpha): useful, but you’ll probably babysit it
I like the direction here. Distributed training jobs waste money when only half the workers start and the rest sit Pending forever, especially on GPU nodes that cost real dollars per minute.
Kubernetes 1.35 introduces alpha gang scheduling via the new Workload API. You can express “all-or-nothing” placement with a quorum (the draft uses minCount language), so the scheduler holds the group until enough capacity exists.
- Good fit: MPI-style jobs, distributed training workers, anything where partial start equals wasted compute.
- Reality check: It’s alpha. Expect sharp edges, feature gates, and confusing scheduling events until you learn the failure modes.
Other stuff in this release: device scheduling polish, some auth tightening, the usual.
Pod certificates (beta): easier mTLS, fewer moving parts
This one matters for multi-tenant clusters. If you ever had to explain why “workload identity” required three extra controllers and a pile of YAML, you’ll appreciate this direction.
In 1.35, pod certificates move to beta around the PodCertificateRequest flow. The idea stays simple: kubelet generates a keypair, the control plane signs, and the Pod consumes the cert chain via a projected volume. Your app gets short-lived X.509 identity without dragging in a whole extra PKI stack.
I do not trust “known issues: none” from any project. Run this in staging before you bet your service mesh rollout on it.
What to do this week if you upgrade production Kubernetes
Be paranoid. If your cluster runs revenue traffic, test this twice.
- Inventory cgroup versions: Run the cgroup check across your fleet, not by spot-checking one node you happened to SSH into.
- Plan a node-image refresh: Treat “moving to cgroup v2 everywhere” as a node pool rebuild, not a toggle you flip on Friday afternoon.
- Schedule runtime work: If you still run containerd 1.x, book time to move to containerd 2.x before your next Kubernetes upgrade cycle.
- Pick one feature to test: Either in-place resize for one StatefulSet, or gang scheduling for one training job. Do not try to validate everything at once.
If you’re reading this during a holiday week, bookmark it. Then come back and do the cgroup audit with fresh eyes, because that’s the part that tends to hurt.
Upgrade testing playbook for 1.35
Do not treat minor version upgrades as routine. Every Kubernetes minor release changes at least one thing that surprises someone in production. Here is the minimum testing sequence for 1.35.
- Read the changelog, not the summary: The blog post highlights features. The changelog lists removals. The removals break your deploy. Focus on “Removed” and “Deprecated” sections in the CHANGELOG.md for v1.35.0.
- Stage the control plane first: Upgrade your staging cluster’s control plane to 1.35 while keeping nodes on 1.34. Run your full test suite. Kubernetes supports n-1 version skew between control plane and nodes, so this is safe and catches API-level breakage early.
- Upgrade one node pool: After control plane is stable, upgrade a single node pool. Watch for pod scheduling issues, resource request changes, and any new admission webhook rejections. Check kubelet logs for warnings about deprecated flags.
- Run a soak test: Leave the mixed-version cluster running for 48 hours under realistic load. Watch for memory leaks in kubelet, connection resets between nodes, and any increase in pod restart counts. If nothing burns for 48 hours, upgrade remaining nodes.
- Verify monitoring and alerts: Kubernetes 1.35 may rename or remove metrics. Check your Grafana dashboards and Prometheus alerting rules. A dashboard that silently stops updating is worse than one that errors — you lose visibility without knowing it.
What to skip in 1.35 and what to adopt immediately
Not every new feature deserves your attention on day one. Here is my take on what matters now versus what can wait.
- Adopt immediately: Any security-related defaults that tightened. If 1.35 makes a previously-optional security feature the default, do not opt out of it. The upstream team made it default because too many clusters were running without it.
- Test carefully: New storage or networking features. These touch the data path. Enable them in staging for a full release cycle before production. StorageClass and CNI changes have historically caused the most upgrade-related incidents.
- Ignore for now: Alpha features, unless you are explicitly testing them for a future workflow. Alpha features can be removed or redesigned without notice. Do not build production workflows on them.
Keep Reading
- Kubernetes Release History
- Docker vs Kubernetes in Production (2026): Security-First Decision Rubric
- Kubernetes Gateway API vs Ingress vs LoadBalancer: What to Use in 2026
- Kubernetes EOL policy explained for on-call humans
- Kubernetes Upgrade Checklist (Minor Version): the runbook I wish I had
- Kubernetes v1.35.1 upgrade preview: stop cgroup v1 and cached-image surprises
- Kubelet Restarts in Kubernetes 1.35.1: Test It Like a P1
Frequently Asked Questions
- What are the biggest breaking changes in Kubernetes 1.35? Three things will bite most clusters: (1) cgroup v1 is no longer the default — kubelet now expects cgroup v2, (2) in-place pod resize is GA which changes how resource limits are enforced, and (3) several beta feature gates that were opt-out are now on by default. Check every feature gate you rely on before upgrading.
- Is in-place pod resize in Kubernetes 1.35 safe to use? It’s GA, meaning it passed stability criteria, but “safe” depends on your workload. In-place resize lets you change CPU/memory limits without restarting the pod. It works well for vertical scaling during traffic spikes. Where it gets risky: stateful workloads where the application doesn’t handle resource changes gracefully, and clusters with tight bin-packing where resizing could trigger unexpected evictions.
- How do I prepare my cluster for the Kubernetes 1.35 upgrade? Three steps before you touch production: (1) Run the upgrade in a staging cluster and check kubelet logs for cgroup errors, (2) Audit your feature gates — run kubectl get cm -n kube-system kubelet-config -o yaml and compare against 1.35 defaults, (3) Test all workloads with the new image pull validation since IfNotPresent behavior changed. Give it 48 hours in staging before rolling to production.
- What Kubernetes version should I be running in production right now? As of early 2026, Kubernetes 1.34.x is the safe choice — it’s the latest fully-patched release with no surprise breaking changes. If you’re still on 1.33 or earlier, you’re within EOL window and should upgrade. Only move to 1.35.x if you’ve completed testing and specifically need its features (pod resize, gang scheduling, pod certificates).