Releases

Kubernetes Operators Explained: What They Are, How They Work, and How to Build One

What Is a Kubernetes Operator? A Kubernetes Operator is a method of packaging, deploying, and managing a Kubernetes application using custom resources and custom controllers. In concrete terms, an operator is a controller that watches a Custom Resource Definition (CRD) and continuously works to make the actual state of the cluster match the desired state […]

February 16, 2026 6 min read

What Is a Kubernetes Operator?

A Kubernetes Operator is a method of packaging, deploying, and managing a Kubernetes application using custom resources and custom controllers. In concrete terms, an operator is a controller that watches a Custom Resource Definition (CRD) and continuously works to make the actual state of the cluster match the desired state declared in that custom resource.

The idea is straightforward: take the operational knowledge that a human engineer would use to run a complex application — how to deploy it, how to scale it, how to recover from failures, how to handle upgrades — and encode that knowledge into software that runs inside the cluster. Instead of writing runbooks and hoping someone follows them at 3 AM, you write a controller that handles it automatically.

Operators extend the Kubernetes API itself. When you install an operator, you get new resource types (like EtcdCluster or PostgresCluster) that you can manage with kubectl just like built-in resources such as Deployments and Services. The operator watches these custom resources, reacts to changes, and takes whatever actions are necessary to maintain the desired state.

If you are running Kubernetes in production — and the adoption numbers suggest most organizations are — understanding operators is essential. They are the standard mechanism for running stateful and complex workloads on Kubernetes, from databases and message queues to monitoring stacks and certificate management.

Origin of the Operator Pattern

The Operator pattern was invented at CoreOS in November 2016. Engineers Brandon Philips and the CoreOS team published a blog post titled “Introducing Operators” that laid out the concept: use the Kubernetes control loop mechanism to automate the management of complex, stateful applications.

The problem they were solving was real. Kubernetes was already good at running stateless applications — deploy a container, scale it horizontally, replace failed instances. But stateful applications like databases, distributed key-value stores, and monitoring systems required specialized knowledge to operate. Scaling an etcd cluster, for example, is not the same as scaling a web server. You need to add new members in a specific order, update peer URLs, handle quorum implications, and manage data migration. These are operational tasks that Kubernetes Deployments and StatefulSets alone cannot handle.

CoreOS shipped two operators alongside the announcement: the etcd Operator for managing etcd clusters and the Prometheus Operator for managing Prometheus monitoring deployments. Both demonstrated the core value proposition — instead of following a multi-step runbook to scale your etcd cluster, you edit a single field in a YAML file and the operator handles the rest.

CoreOS was acquired by Red Hat in 2018, and Red Hat was subsequently acquired by IBM. The operator concept survived these transitions and became a cornerstone of the OpenShift platform. More importantly, the broader Kubernetes community adopted operators as the standard pattern for running complex workloads. The original CoreOS blog posts that introduced the concept have since gone offline, but the pattern they described is now documented in the official Kubernetes documentation and used by hundreds of open source projects.

The Problem Operators Solve

🔔 Never Miss a Breaking Change

Monthly release roundup — breaking changes, security patches, and upgrade guides across your stack.

✅ You're in! Check your inbox for confirmation.

To understand why operators exist, consider what it takes to run a PostgreSQL database cluster on Kubernetes without one.

You need to:

  1. Deploy the primary instance with the correct storage configuration and initialization scripts.
  2. Set up streaming replication to one or more standby instances.
  3. Configure connection pooling (PgBouncer or similar).
  4. Manage automated backups to object storage on a schedule.
  5. Handle failover when the primary goes down — promote a standby, reconfigure replication, update connection endpoints.
  6. Perform minor version upgrades with rolling restarts in the correct order (standbys first, then primary).
  7. Perform major version upgrades with pg_upgrade, which requires a completely different procedure.
  8. Monitor replication lag, connection counts, and query performance.
  9. Scale read replicas based on load.

Each of these tasks requires domain-specific knowledge about PostgreSQL. Kubernetes knows nothing about replication lag or WAL archiving. A StatefulSet can create pods with persistent storage and stable network identities, but it cannot orchestrate a failover or run a backup.

This is the gap that operators fill. A PostgreSQL operator (such as CloudNativePG, Crunchy PGO, or Zalando’s postgres-operator) encodes all of this operational knowledge into a controller. You declare what you want in a PostgresCluster custom resource, and the operator handles the how.

How Operators Work: Custom Resources + Custom Controllers

Every operator consists of two components working together: a Custom Resource Definition (CRD) that extends the Kubernetes API with new resource types, and a custom controller that watches those resources and acts on them.

Custom Resource Definitions (CRDs)

A CRD tells the Kubernetes API server about a new resource type. Once you apply a CRD to a cluster, you can create, read, update, and delete instances of that resource type just like any built-in resource.

Here is a simplified CRD that defines an EtcdCluster resource:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: etcdclusters.etcd.database.coreos.com
spec:
  group: etcd.database.coreos.com
  names:
    kind: EtcdCluster
    listKind: EtcdClusterList
    plural: etcdclusters
    singular: etcdcluster
    shortNames:
      - etcd
  scope: Namespaced
  versions:
    - name: v1beta2
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                size:
                  type: integer
                  minimum: 1
                version:
                  type: string

After applying this CRD, users can create EtcdCluster resources:

apiVersion: etcd.database.coreos.com/v1beta2
kind: EtcdCluster
metadata:
  name: my-etcd-cluster
  namespace: default
spec:
  size: 3
  version: "3.5.17"

The CRD alone does nothing. The Kubernetes API server will store this resource, but nothing happens until a controller is watching for it.

The Custom Controller (Reconciliation Loop)

The controller is the brain of the operator. It runs inside the cluster (typically as a Deployment) and implements a reconciliation loop — the same control loop pattern that drives every built-in Kubernetes controller.

The reconciliation loop works like this:

  1. Observe — Watch the API server for changes to the custom resource (and any related resources like Pods, Services, or ConfigMaps).
  2. Analyze — Compare the desired state (what the custom resource spec says) with the actual state (what currently exists in the cluster).
  3. Act — Take whatever actions are needed to make the actual state match the desired state. Create pods, update configurations, trigger failovers, run backups.

This is a level-triggered system, not edge-triggered. The controller does not react to individual events (“a pod was deleted”). Instead, it reacts to the current state (“the spec says 3 replicas but only 2 exist”). This makes operators inherently resilient to missed events, network partitions, and controller restarts — the next reconciliation will always pick up where things left off.

Here is a minimal reconciliation function in Go using the controller-runtime library:

func (r *EtcdClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := log.FromContext(ctx)

    // Fetch the EtcdCluster resource
    var cluster etcdv1beta2.EtcdCluster
    if err := r.Get(ctx, req.NamespacedName, &cluster); err != nil {
        // Resource was deleted; nothing to do
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // List existing pods for this cluster
    var pods corev1.PodList
    if err := r.List(ctx, &pods, client.InNamespace(req.Namespace),
        client.MatchingLabels{"app": "etcd", "cluster": cluster.Name}); err != nil {
        return ctrl.Result{}, err
    }

    currentSize := len(pods.Items)
    desiredSize := cluster.Spec.Size

    // Scale up: create new members
    if currentSize < desiredSize {
        log.Info("Scaling up", "current", currentSize, "desired", desiredSize)
        if err := r.addMember(ctx, &cluster, currentSize); err != nil {
            return ctrl.Result{}, err
        }
        // Requeue to continue scaling one member at a time
        return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
    }

    // Scale down: remove excess members
    if currentSize > desiredSize {
        log.Info("Scaling down", "current", currentSize, "desired", desiredSize)
        if err := r.removeMember(ctx, &cluster, pods.Items[currentSize-1]); err != nil {
            return ctrl.Result{}, err
        }
        return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
    }

    // Update status
    cluster.Status.ReadyMembers = countReadyPods(pods.Items)
    if err := r.Status().Update(ctx, &cluster); err != nil {
        return ctrl.Result{}, err
    }

    return ctrl.Result{}, nil
}

The key thing to notice: this function is idempotent. You can call it a hundred times and it will always converge toward the desired state without creating duplicates or causing conflicts. This is the fundamental design principle of Kubernetes controllers.

The Operator Framework

Building an operator from scratch requires significant boilerplate: setting up the project structure, generating CRD manifests, wiring up the controller to the manager, handling RBAC, and packaging everything for deployment. The Operator Framework, originally created by CoreOS and now maintained by Red Hat and the community, provides tools to streamline this process.

Operator SDK

The Operator SDK (currently at v1.42.x) is the primary tool for building operators. It is built on top of Kubebuilder and the controller-runtime library, and supports three approaches for building operators:

  • Go-based operators — Write the controller in Go using controller-runtime. This gives you full control and is the standard choice for production operators that need complex logic.
  • Ansible-based operators — Write reconciliation logic as Ansible playbooks and roles. The SDK provides a proxy that translates Kubernetes events into Ansible runs. Good for teams with existing Ansible expertise.
  • Helm-based operators — Wrap an existing Helm chart as an operator. The SDK handles rendering the chart with values from the custom resource spec. This is the fastest path from “I have a Helm chart” to “I have an operator,” though it limits what operational logic you can encode.

Scaffolding a new Go-based operator project takes a single command:

# Initialize a new operator project
operator-sdk init --domain example.com --repo github.com/example/my-operator

# Create a new API (CRD + controller)
operator-sdk create api --group cache --version v1alpha1 --kind Memcached --resource --controller

This generates the entire project structure: CRD definitions, controller skeleton, RBAC manifests, Dockerfile, Makefile, and test scaffolding. You fill in the reconciliation logic and the CRD spec, and the SDK handles the rest.

Operator Lifecycle Manager (OLM)

The Operator Lifecycle Manager (OLM) handles the installation, upgrade, and lifecycle management of operators themselves. Think of it as a package manager for operators.

OLM provides:

  • Dependency resolution — If your operator depends on cert-manager, OLM ensures cert-manager is installed first.
  • Update channels — Operators can publish to stable, fast, and candidate channels, similar to how Kubernetes itself manages releases (see the Kubernetes support and EOL policy for how version lifecycles work).
  • RBAC scoping — OLM ensures operators only get the permissions they declare.
  • CRD upgrade safety — OLM validates that CRD schema changes are backward compatible before applying them.

The community is currently transitioning to OLM v1, which simplifies the API surface with the new ClusterExtension resource, replaces CRD-based catalogs with a RESTful API for better performance, and adds support for Helm charts and plain Kubernetes manifests alongside traditional operator bundles. OLM v0 remains in maintenance mode.

OperatorHub

OperatorHub.io is the public registry for Kubernetes operators. It hosts hundreds of community and vendor operators, each with metadata about supported Kubernetes versions, maturity level, and capabilities. If you are looking for an operator for a specific workload, OperatorHub is the first place to check.

Real-World Operator Examples

The best way to understand the value of operators is to look at ones that are widely deployed in production.

Prometheus Operator (kube-prometheus-stack)

The Prometheus Operator was one of the two original operators shipped by CoreOS. It introduces CRDs like Prometheus, ServiceMonitor, PodMonitor, and AlertmanagerConfig that let you declaratively define your monitoring stack.

Instead of manually editing Prometheus configuration files and reloading the server, you create a ServiceMonitor resource that tells Prometheus which services to scrape:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
  labels:
    release: kube-prometheus-stack
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

The operator watches for ServiceMonitor resources, generates the Prometheus scrape configuration, and reloads Prometheus automatically. This is the standard way to configure monitoring in Kubernetes clusters today.

cert-manager

cert-manager (currently at v1.19.x) automates the management of TLS certificates in Kubernetes. It introduces CRDs like Certificate, Issuer, ClusterIssuer, and CertificateRequest. You declare what certificates you need, and cert-manager handles the entire lifecycle: issuing, renewing, and rotating certificates from sources like Let’s Encrypt (ACME), HashiCorp Vault, or internal CAs.

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: my-app-tls
  namespace: default
spec:
  secretName: my-app-tls-secret
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - app.example.com
    - www.example.com
  duration: 2160h    # 90 days
  renewBefore: 360h  # Renew 15 days before expiry

cert-manager is arguably the most widely deployed operator in the Kubernetes ecosystem. It solves a problem that every production cluster has (TLS certificate management) and does it in a way that is both declarative and automated.

CloudNativePG

CloudNativePG is a PostgreSQL operator that manages the full lifecycle of PostgreSQL clusters: provisioning, high availability with automated failover, continuous backup to object storage, point-in-time recovery, rolling updates, and monitoring integration. It demonstrates the full potential of the operator pattern for stateful workloads — managing a production PostgreSQL cluster becomes a matter of declaring a Cluster resource rather than writing shell scripts.

Strimzi (Apache Kafka Operator)

Strimzi manages Apache Kafka clusters on Kubernetes. Running Kafka on Kubernetes is notoriously complex — broker configuration, ZooKeeper management (or KRaft mode), topic management, user authentication, rolling upgrades with zero message loss. Strimzi encodes all of this into operators that manage Kafka, KafkaTopic, KafkaUser, and KafkaConnect resources.

Building a Basic Operator with Operator SDK

Here is a step-by-step walkthrough of building a simple operator that manages a Memcached deployment. This uses the Go-based approach with Operator SDK v1.42.x and targets Kubernetes 1.35 or later.

Prerequisites

  • Go 1.23 or later
  • Operator SDK CLI (v1.42+)
  • A Kubernetes cluster (kind, minikube, or a real cluster)
  • kubectl configured to talk to your cluster

Step 1: Scaffold the Project

mkdir memcached-operator && cd memcached-operator
operator-sdk init --domain example.com --repo github.com/example/memcached-operator
operator-sdk create api --group cache --version v1alpha1 --kind Memcached --resource --controller

This creates the full project layout:

memcached-operator/
├── api/
│   └── v1alpha1/
│       ├── memcached_types.go    # CRD type definitions
│       └── zz_generated.deepcopy.go
├── cmd/
│   └── main.go                   # Operator entry point
├── config/
│   ├── crd/                      # Generated CRD manifests
│   ├── manager/                  # Operator Deployment manifest
│   ├── rbac/                     # RBAC rules
│   └── samples/                  # Example CR
├── internal/
│   └── controller/
│       ├── memcached_controller.go       # Controller logic
│       └── memcached_controller_test.go  # Controller tests
├── Dockerfile
├── Makefile
└── go.mod

Step 2: Define the API

Edit api/v1alpha1/memcached_types.go to define the spec and status of your custom resource:

package v1alpha1

import (
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

// MemcachedSpec defines the desired state of Memcached
type MemcachedSpec struct {
    // Size is the number of Memcached pods to run
    // +kubebuilder:validation:Minimum=1
    // +kubebuilder:validation:Maximum=10
    Size int32 `json:"size"`

    // MemoryLimit is the maximum memory for each Memcached instance
    // +kubebuilder:default="64m"
    MemoryLimit string `json:"memoryLimit,omitempty"`
}

// MemcachedStatus defines the observed state of Memcached
type MemcachedStatus struct {
    // ReadyReplicas is the number of pods in Ready state
    ReadyReplicas int32 `json:"readyReplicas"`

    // Conditions represent the latest available observations
    Conditions []metav1.Condition `json:"conditions,omitempty"`
}

// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:printcolumn:name="Size",type=integer,JSONPath=`.spec.size`
// +kubebuilder:printcolumn:name="Ready",type=integer,JSONPath=`.status.readyReplicas`
type Memcached struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`

    Spec   MemcachedSpec   `json:"spec,omitempty"`
    Status MemcachedStatus `json:"status,omitempty"`
}

// +kubebuilder:object:root=true
type MemcachedList struct {
    metav1.TypeMeta `json:",inline"`
    metav1.ListMeta `json:"metadata,omitempty"`
    Items           []Memcached `json:"items"`
}

After editing the types, regenerate the CRD manifests and deep copy functions:

make manifests generate

Step 3: Implement the Controller

Edit internal/controller/memcached_controller.go with the reconciliation logic:

package controller

import (
    "context"
    "fmt"

    appsv1 "k8s.io/api/apps/v1"
    corev1 "k8s.io/api/core/v1"
    "k8s.io/apimachinery/pkg/api/errors"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/apimachinery/pkg/runtime"
    "k8s.io/apimachinery/pkg/types"
    ctrl "sigs.k8s.io/controller-runtime"
    "sigs.k8s.io/controller-runtime/pkg/client"
    "sigs.k8s.io/controller-runtime/pkg/log"

    cachev1alpha1 "github.com/example/memcached-operator/api/v1alpha1"
)

type MemcachedReconciler struct {
    client.Client
    Scheme *runtime.Scheme
}

// +kubebuilder:rbac:groups=cache.example.com,resources=memcacheds,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=cache.example.com,resources=memcacheds/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=core,resources=pods,verbs=get;list;watch

func (r *MemcachedReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := log.FromContext(ctx)

    // Fetch the Memcached instance
    memcached := &cachev1alpha1.Memcached{}
    err := r.Get(ctx, req.NamespacedName, memcached)
    if err != nil {
        if errors.IsNotFound(err) {
            log.Info("Memcached resource not found; ignoring since it must have been deleted")
            return ctrl.Result{}, nil
        }
        return ctrl.Result{}, err
    }

    // Check if a Deployment already exists; if not, create one
    found := &appsv1.Deployment{}
    err = r.Get(ctx, types.NamespacedName{Name: memcached.Name, Namespace: memcached.Namespace}, found)
    if err != nil && errors.IsNotFound(err) {
        dep := r.deploymentForMemcached(memcached)
        log.Info("Creating Deployment", "name", dep.Name)
        if err = r.Create(ctx, dep); err != nil {
            return ctrl.Result{}, err
        }
        return ctrl.Result{Requeue: true}, nil
    } else if err != nil {
        return ctrl.Result{}, err
    }

    // Ensure the Deployment replica count matches the spec
    size := memcached.Spec.Size
    if *found.Spec.Replicas != size {
        found.Spec.Replicas = &size
        log.Info("Updating Deployment replicas", "from", *found.Spec.Replicas, "to", size)
        if err = r.Update(ctx, found); err != nil {
            return ctrl.Result{}, err
        }
        return ctrl.Result{Requeue: true}, nil
    }

    // Update status with the number of ready replicas
    memcached.Status.ReadyReplicas = found.Status.ReadyReplicas
    if err := r.Status().Update(ctx, memcached); err != nil {
        return ctrl.Result{}, err
    }

    return ctrl.Result{}, nil
}

func (r *MemcachedReconciler) deploymentForMemcached(m *cachev1alpha1.Memcached) *appsv1.Deployment {
    labels := map[string]string{"app": "memcached", "memcached_cr": m.Name}
    replicas := m.Spec.Size

    dep := &appsv1.Deployment{
        ObjectMeta: metav1.ObjectMeta{
            Name:      m.Name,
            Namespace: m.Namespace,
        },
        Spec: appsv1.DeploymentSpec{
            Replicas: &replicas,
            Selector: &metav1.LabelSelector{
                MatchLabels: labels,
            },
            Template: corev1.PodTemplateSpec{
                ObjectMeta: metav1.ObjectMeta{
                    Labels: labels,
                },
                Spec: corev1.PodSpec{
                    Containers: []corev1.Container{{
                        Name:    "memcached",
                        Image:   "memcached:1.6-alpine",
                        Command: []string{"memcached", "-m", m.Spec.MemoryLimit},
                        Ports: []corev1.ContainerPort{{
                            ContainerPort: 11211,
                            Name:          "memcached",
                        }},
                    }},
                },
            },
        },
    }
    // Set the Memcached instance as the owner of the Deployment.
    // This ensures the Deployment is garbage collected when the CR is deleted.
    ctrl.SetControllerReference(m, dep, r.Scheme)
    return dep
}

func (r *MemcachedReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&cachev1alpha1.Memcached{}).
        Owns(&appsv1.Deployment{}).
        Complete(r)
}

Key patterns to note in this code:

  • Owner references — The operator sets the Memcached CR as the owner of the Deployment. When the CR is deleted, Kubernetes garbage collection automatically cleans up the Deployment and its Pods.
  • Idempotent reconciliation — Every call to Reconcile checks the current state and only takes action if there is drift. If you call it when everything is already aligned, it does nothing.
  • Requeue — After making a change, the controller returns Requeue: true to trigger another reconciliation. This allows the controller to verify its change took effect.
  • Status subresource — The controller updates the .status field separately from the spec, using the status subresource. This prevents conflicts between users editing the spec and the controller updating status.

Step 4: Build, Deploy, and Test

# Build the operator image
make docker-build IMG=my-registry/memcached-operator:v0.1.0

# Push to a registry
make docker-push IMG=my-registry/memcached-operator:v0.1.0

# Install the CRD
make install

# Deploy the operator to the cluster
make deploy IMG=my-registry/memcached-operator:v0.1.0

# Create a sample Memcached resource
kubectl apply -f config/samples/cache_v1alpha1_memcached.yaml

# Watch the operator create the Deployment
kubectl get memcached
kubectl get deployments
kubectl get pods

You should see the operator create a Deployment with the number of replicas matching the size field in your Memcached resource. Edit the size field and the operator will scale the Deployment accordingly.

Operator Maturity: The Capability Model

Not all operators are equally sophisticated. The Operator Framework defines a five-level capability model that describes how much operational logic an operator encodes:

  1. Level 1 — Basic Install: Automated deployment and configuration. The operator can install the application with sensible defaults.
  2. Level 2 — Seamless Upgrades: The operator can handle version upgrades, including migrations and data format changes.
  3. Level 3 — Full Lifecycle: Backup, restore, and failure recovery. The operator can take backups on a schedule and restore from them.
  4. Level 4 — Deep Insights: Metrics, alerts, and log processing. The operator exposes application-specific telemetry and can integrate with the cluster’s monitoring stack.
  5. Level 5 — Auto Pilot: Automatic scaling, tuning, and self-healing based on application-specific signals. The operator adjusts configuration and resources proactively.

Most Helm-based operators top out at Level 2. Reaching Levels 3-5 requires a Go-based or Ansible-based operator with substantial domain logic. When evaluating whether to adopt an operator or build your own, the maturity level is a useful benchmark.

When to Use Operators (and When Not To)

Operators are powerful, but they are not the right tool for every workload. Here is a practical decision framework.

Use an Operator When:

  • You are running a stateful application that requires domain-specific operational procedures (databases, message queues, distributed storage).
  • Operations require multi-step orchestration — the sequence of actions matters, not just the end state. Failover, backup, rolling upgrades with specific ordering.
  • You need automated Day-2 operations — backups, certificate rotation, version upgrades, scaling decisions that require application-level awareness.
  • You want to provide a self-service API for a platform team to offer internal services. Teams can create a PostgresCluster resource without knowing the operational details.
  • A mature operator already exists for your workload. Do not build what you can adopt. Check OperatorHub.io, the CNCF landscape, and GitHub.

Do Not Use an Operator When:

  • A Deployment or StatefulSet is sufficient. If your application is stateless or only needs basic deployment semantics, a Deployment with an HPA (Horizontal Pod Autoscaler) is simpler and has zero additional moving parts.
  • A Helm chart solves the problem. If you only need to template configuration and do not need active reconciliation, a Helm chart is less operational overhead than an operator.
  • You do not have the capacity to maintain it. A custom operator is production software. It needs tests, CI/CD, dependency updates, and monitoring. If you build an operator and then do not maintain it, you have created a liability.
  • The operational logic is trivial. If the “operator” just creates a Deployment and a Service, you have added complexity for no benefit.

A useful rule of thumb: if the operational runbook for your application is more than a page long and involves conditional logic (“if replica lag exceeds X, then do Y”), it is a good candidate for an operator. If the runbook is “run helm upgrade,” it is not.

Operators and the Kubernetes Release Cycle

Operators interact deeply with the Kubernetes API, which means they are sensitive to Kubernetes version changes. When you plan a Kubernetes upgrade, verifying operator compatibility is a critical step that is often overlooked.

Here is what you need to watch for:

  • API version deprecations. Kubernetes regularly deprecates and removes API versions. If an operator creates resources using a removed API version (e.g., extensions/v1beta1 for Ingress), it will break after a Kubernetes upgrade even if the operator itself still starts.
  • CRD schema changes. Kubernetes has tightened CRD validation over time. A CRD that was accepted in Kubernetes 1.30 might fail validation in 1.35 if it uses deprecated schema features.
  • controller-runtime compatibility. The controller-runtime library that most operators depend on is tied to specific Kubernetes versions. Operator SDK v1.42.x uses controller-runtime v0.21.x, which targets Kubernetes 1.33+. Running an operator built against an older controller-runtime on a newer cluster usually works, but going the other direction can cause issues.
  • Webhook compatibility. If the operator uses admission webhooks, changes to the webhook API between Kubernetes versions can cause failures.

Before upgrading your cluster, check each installed operator’s compatibility matrix against your target Kubernetes version. Most mature operators document this explicitly. The Kubernetes support and EOL policy covers how long each version receives patches, which directly affects how long operator maintainers need to support each version.

Writing Production-Grade Operators: Best Practices

If you are building an operator for production use, the Memcached example above is a starting point but not an endpoint. Here are the patterns that separate toy operators from production-grade ones.

Handle Finalizers

If your operator creates resources outside the cluster (cloud resources, DNS records, external database schemas), use finalizers to ensure those resources are cleaned up when the custom resource is deleted. Without finalizers, Kubernetes will delete the CR immediately, and your external resources will be orphaned.

const finalizerName = "cache.example.com/finalizer"

func (r *MemcachedReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    memcached := &cachev1alpha1.Memcached{}
    if err := r.Get(ctx, req.NamespacedName, memcached); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // Handle deletion
    if !memcached.DeletionTimestamp.IsZero() {
        if controllerutil.ContainsFinalizer(memcached, finalizerName) {
            // Run cleanup logic (delete external resources)
            if err := r.cleanupExternalResources(ctx, memcached); err != nil {
                return ctrl.Result{}, err
            }
            // Remove the finalizer to allow deletion to proceed
            controllerutil.RemoveFinalizer(memcached, finalizerName)
            if err := r.Update(ctx, memcached); err != nil {
                return ctrl.Result{}, err
            }
        }
        return ctrl.Result{}, nil
    }

    // Add finalizer if not present
    if !controllerutil.ContainsFinalizer(memcached, finalizerName) {
        controllerutil.AddFinalizer(memcached, finalizerName)
        if err := r.Update(ctx, memcached); err != nil {
            return ctrl.Result{}, err
        }
    }

    // ... rest of reconciliation logic
}

Use Server-Side Apply

Instead of Get-Modify-Update patterns that are prone to conflict errors, use Server-Side Apply (SSA) for managing owned resources. SSA lets the controller declare the fields it manages, and the API server handles merge conflicts automatically. This was promoted to stable in Kubernetes 1.22 and is the recommended approach for new operators.

Expose Metrics

Your operator should expose Prometheus metrics about its own health and the resources it manages. At minimum, expose:

  • Reconciliation duration (histogram)
  • Reconciliation errors (counter)
  • Number of managed resources by status
  • Queue depth and processing latency

The controller-runtime library provides a metrics server out of the box. Register your custom metrics and they will be available for scraping.

Write Integration Tests

The envtest package (part of controller-runtime) lets you run the Kubernetes API server and etcd locally for integration testing. This is significantly more valuable than unit tests that mock the Kubernetes client, because it exercises the real API server behavior including validation, defaulting, and conflict detection.

var _ = Describe("Memcached controller", func() {
    Context("When creating a Memcached resource", func() {
        It("Should create a Deployment with matching replicas", func() {
            ctx := context.Background()
            memcached := &cachev1alpha1.Memcached{
                ObjectMeta: metav1.ObjectMeta{
                    Name:      "test-memcached",
                    Namespace: "default",
                },
                Spec: cachev1alpha1.MemcachedSpec{
                    Size:        3,
                    MemoryLimit: "128m",
                },
            }
            Expect(k8sClient.Create(ctx, memcached)).Should(Succeed())

            // Eventually the controller should create a Deployment
            deployment := &appsv1.Deployment{}
            Eventually(func() error {
                return k8sClient.Get(ctx, types.NamespacedName{
                    Name: "test-memcached", Namespace: "default",
                }, deployment)
            }, timeout, interval).Should(Succeed())

            Expect(*deployment.Spec.Replicas).Should(Equal(int32(3)))
        })
    })
})

Implement Structured Logging and Events

Use structured logging (the logr interface provided by controller-runtime) so your operator’s logs are machine-parseable. Additionally, emit Kubernetes Events for significant state changes — this integrates your operator with the standard kubectl describe workflow that engineers are already familiar with.

The Operator Ecosystem in 2026

The operator pattern has matured significantly since 2016. Here is where things stand today:

  • Operator SDK v1.42.x is the current stable release, built on Kubebuilder v4.6 and controller-runtime v0.21. It targets Kubernetes 1.33+ and includes scaffolding for TLS certificate management for webhooks and metrics endpoints.
  • OLM v1 is under active development, shifting to the ClusterExtension API and a RESTful catalog backend. OLM v0 is in maintenance mode.
  • OperatorHub.io hosts hundreds of operators across databases, monitoring, networking, security, and application platforms.
  • Kubernetes 1.35 shipped in December 2025 with in-place pod resize reaching stable and improvements to the control plane that benefit operator workloads. Kubernetes 1.36 is scheduled for April 2026.
  • CNCF operator projects like Prometheus Operator, cert-manager, Flux, and ArgoCD continue to push the boundaries of what operators can manage.

The pattern has also expanded beyond its original scope. Operators are no longer just for stateful applications. They are used for GitOps (Flux, ArgoCD), policy enforcement (Kyverno, OPA Gatekeeper), networking (Cilium, Istio), and platform engineering (Crossplane for provisioning cloud infrastructure via Kubernetes APIs).

Practical Advice for Getting Started

If you are new to operators, here is a prioritized path forward:

  1. Use existing operators before building your own. Install cert-manager, the Prometheus Operator, or CloudNativePG. Study how they work. Read their CRDs. Look at the events and status conditions they produce. This will teach you the patterns faster than any tutorial.
  2. Read the Kubernetes documentation on controllers. The official docs on custom resources, controllers, and the operator pattern are well-written and authoritative. Understanding how the built-in controllers (Deployment controller, ReplicaSet controller) work will make custom controllers intuitive.
  3. Start with the Operator SDK tutorial. The official Go operator tutorial walks you through building a Memcached operator similar to the one in this article. Work through it hands-on with a local kind or minikube cluster.
  4. Study production operators. Read the source code of mature operators like cert-manager or the Prometheus Operator. Pay attention to how they handle error cases, implement finalizers, manage status conditions, and structure their reconciliation logic.
  5. Build your first operator for a real (but low-risk) use case. An internal tool, a configuration manager, something where the blast radius of bugs is small. Get it running, write tests, iterate.

The operator pattern is one of the most important abstractions in the Kubernetes ecosystem. It bridges the gap between “Kubernetes can run containers” and “Kubernetes can run anything” by encoding the operational knowledge that complex applications require. Whether you are adopting existing operators or building your own, understanding how they work will make you a more effective Kubernetes engineer.


Related Reading

🛠️ Try These Free Tools

⚠️ K8s Manifest Deprecation Checker

Paste your Kubernetes YAML to detect deprecated APIs before upgrading.

🐳 Dockerfile Security Linter

Paste a Dockerfile for instant security and best-practice analysis.

📦 Dependency EOL Scanner

Paste your dependency file to check for end-of-life packages.

See all free tools →