
Kubelet Restarts in Kubernetes 1.35.1: Test It Like a P1
I’ve watched a “safe” kubelet restart knock 20% of backends out of a load balancer for 40 seconds.
If your last outage included Pods flipping NotReady during node maintenance, treat Kubernetes 1.35.1 as a time-boxed drill, not a casual patch you “get to” later.
Dates matter more than opinions
Jan 13, 2026 comes fast.
The Kubernetes patch train uses a cherry-pick deadline (Jan 9, 2026) and a target release date (Jan 13, 2026). That means the patch contents stop changing days before you see the bits. Plan as if your favorite fix will miss the train, because it often does.
- What I do in practice: I schedule a 60 minute staging canary window before the cherry-pick deadline, not after. That way, if we find a real regression, we still have time to pick a mitigation that does not involve panic.
- What I do not do: I do not assume “it’s a patch” means “it’s harmless.” Patch releases break things in boring ways, which makes them harder to spot.
The kubelet restart failure mode you actually feel
This bit me when we restarted kubelet during a kernel rollout.
Nothing “crashed,” but Pods flipped NotReady, EndpointSlices churned, and the external load balancer started failing health checks. Users saw 502s. The graphs looked like a saw blade for about a minute.
Kubernetes 1.35 introduced work to reduce readiness chaos during kubelet restarts (KEP-4781). You should not trust it blindly. You should prove it on your stack: your CNI, your ingress, your PDBs, your readiness probes, your autoscalers, and whatever glue you built at 2 AM six months ago.
A reproducible canary test (restart loop)
Do this once.
Pick one canary node that actually serves traffic. Keep load on it. Then restart kubelet repeatedly, because the “one restart on an empty node” test tells you almost nothing.
- Baseline snapshot: Record request error rate at the edge, apiserver latency, Pod readiness transitions, and EndpointSlice update rate for 5 minutes before you touch anything.
- Restart loop: Run sudo systemctl restart kubelet 10 times with a 30 second pause between restarts. Watch whether Pods go NotReady without container restarts, and whether EndpointSlice updates spike.
- Pass or fail: Decide this up front. For most teams, “no user-visible error spike” beats any internal metric. For critical systems, I require a written go/no-go, because “looked fine” turns into a postmortem sentence.
Do the meaner test (node reboot under load)
Reboot it.
Kubelet restart behavior matters, but node reboots still trigger a different chain of events. Reboot the canary node during steady traffic and watch how quickly your load balancer puts the node back in rotation. If you only test kubelet, you miss the ugly interactions with CNI startup ordering and readiness probes.
Some folks skip canaries for patch releases. I don’t. I get it if you run dev clusters, but production deserves paranoia.
Identity changes: Pod certificates can step on your existing machinery
🔔 Never Miss a Breaking Change
Get weekly release intelligence — breaking changes, security patches, and upgrade guides before they break your build.
✅ You're in! Check your inbox for confirmation.
The thing nobody mentions is the quiet double-rotation.
If you run a service mesh plus custom certificate tooling, you can end up with two controllers trying to “help” the same Pod. Kubernetes 1.35 includes Pod certificates via PodCertificateRequest (KEP-4317) in beta. Even if you never enable it, test your current flows next to it, because overlaps show up as permission errors, unexpected mounts, or cert files that change twice as fast as your clients expect.
- What to watch: Cert issuance failures, rotation spam, and silent fallbacks where the mesh “works” but uses an older cert than you think.
- Quick mitigation: If you see doubled rotation, disable one system in staging and retest. Do not “ship and see” on a Monday.
Runtime reality: containerd v1.x will not stay comfortable
Stop punting this.
Kubernetes 1.35 marks a shift away from containerd v1.x support. If you combine a Kubernetes upgrade, a runtime major upgrade, and your vendor’s packaging quirks in the same week, you will write a long postmortem. I don’t trust “we’ll do it later” as a strategy.
Check what you run on every node before you touch 1.35.1. Then file the runtime migration ticket you have avoided. Other stuff in this release: dependency bumps, some image updates, the usual.
Copy-paste workflow (generic)
Keep it boring.
This stays upstream-generic on purpose. Your distro may wrap these steps, and managed providers will do their own thing, so adapt it.
- Verify versions: Run kubectl version –short and kubelet –version and write the output into the change record.
- Stage upgrade (kubeadm): Run kubeadm upgrade plan, then kubeadm upgrade apply v1.35.1 in staging first.
- Canary kubelet restart: Restart kubelet on one traffic-serving node. Then do the restart loop and the reboot test.
Red flags I grep for while testing
These show up fast.
I watch events, logs, and client errors in the same window. If you only watch Pods, you miss the customer impact.
- NotReady flaps tied to kubelet restarts: Especially when containers never restarted, because that points at readiness state handling and downstream endpoint churn.
- EndpointSlice spikes and controller retries: Traffic can drop while every Pod looks “Running.” That’s the worst kind of incident.
- CRI warnings about deprecation or losing support: Treat them as a migration ticket, not “log noise.”
- Cert issuance or rotation errors: If you run workload identity tooling, assume a subtle overlap until proven otherwise.
Test it like you mean it.
If staging teaches you something uncomfortable, good. There’s probably a cleaner way to automate all of this, but the manual drill still catches the sharp edges.