Stateful workloads on Kubernetes have always forced a hard choice: lean on your cloud provider’s storage primitives and accept the cost and vendor lock-in, or deploy a distributed storage layer yourself and accept the operational complexity. That trade-off is genuinely interesting now because the open-source options have matured considerably. Longhorn is at v1.11, Rook-Ceph is CNCF-graduated and shipping v1.19, and OpenEBS has consolidated its engine lineup around a high-performance SPDK core. The gap between “production-ready” and “still too rough for stateful workloads” has narrowed significantly.
This guide covers what you actually need to know to pick the right tool: architecture trade-offs, measured performance characteristics, resource requirements, and concrete deployment patterns. It is aimed at SREs and platform engineers running Kubernetes on bare metal, hybrid cloud, or edge environments where cloud-native storage isn’t a given.
Why You Need a Storage Layer at All
Kubernetes itself does not provide storage. It provides an API (the CSI spec) for storage systems to attach, mount, and manage persistent volumes. On managed clouds, that gap is filled by EBS, GCE PD, Azure Disk, and similar services. On bare metal, on-prem, or at the edge, you need something else.
The options fall into two architectural camps:
- Shared-nothing, software-defined block storage: Each node contributes local disks to a distributed pool. Data is replicated across nodes. The system presents block devices (PVs) to pods. Examples: Rook-Ceph, Longhorn, OpenEBS Mayastor, LINSTOR.
- Local storage provisioners with topology awareness: Volumes are pinned to the node where their data lives. No replication, but excellent performance and simplicity. Examples: TopoLVM, OpenEBS Local PV, democratic-csi.
The replicated approach gives you HA; the local approach gives you raw speed. The right answer depends on your workload’s failure tolerance versus latency requirements.
Longhorn
Current version: 1.11.0 (v1.8.x also in active support) | CNCF status: Incubating | License: Apache 2.0
Longhorn was developed by Rancher Labs, donated to the CNCF in 2019, and reached incubating status in 2021. It is currently the easiest distributed block storage system to get running on Kubernetes, which is both its biggest strength and its main design constraint.
Architecture
Longhorn takes a microservices-within-Kubernetes approach: it runs a dedicated controller (longhorn-manager) plus a per-volume instance manager for each PV. Failure isolation is strong because a controller crash affects exactly one volume, not the whole cluster. The trade-off is resource consumption: Longhorn uses approximately 300 MB RAM per worker node at baseline before accounting for volume replicas.
What Works Well
- Single-command Helm install. The full stack deploys from one chart with sane defaults.
- Disaster recovery volumes: you can maintain a standby PV in a second cluster and fail over with a defined RTO. This is genuinely useful and uncommon in free tools.
- Web UI with volume management, snapshot scheduling, and backup configuration baked in.
- S3-compatible backup targets (any S3 endpoint, not just AWS).
- Non-disruptive live upgrades: you can upgrade the entire Longhorn stack without taking volumes offline.
Limitations
- Performance caps out around 19,000 IOPS in independent benchmarks, well below what Ceph or Mayastor can deliver on the same hardware.
- Sequential throughput measured at approximately 610 MB/s.
- Not well-suited for NVMe-heavy workloads where storage latency matters; the software path adds overhead.
- The dedicated per-volume process model means resource consumption scales with volume count.
Installation (Helm)
helm repo add longhorn https://charts.longhorn.io
helm repo update
helm install longhorn longhorn/longhorn \
--namespace longhorn-system \
--create-namespace \
--set defaultSettings.defaultReplicaCount=3
Best for: Small to medium clusters (under 50 nodes), edge deployments, k3s environments, teams that need DR capabilities without enterprise contracts, Rancher-managed clusters.
Rook-Ceph
Current version: v1.19.2 | CNCF status: Graduated (since October 2020) | License: Apache 2.0 | Supported Kubernetes: v1.30-v1.35
Rook is not a storage system itself. It is an operator that deploys and manages Ceph inside Kubernetes, handling day-2 operations like OSD management, pool configuration, and failure recovery. Ceph is one of the most battle-tested distributed storage systems in existence; Rook makes it consumable from Kubernetes without running a separate cluster.
What Ceph Provides Through Rook
- Block storage (RBD): Standard PVs for stateful workloads.
- Filesystem (CephFS): ReadWriteMany volumes for shared access.
- Object storage (RGW): S3-compatible API, exposable as an S3 endpoint within or outside the cluster.
This breadth is Rook-Ceph’s defining characteristic. It is the only open-source option in this comparison that handles all three storage interfaces from a single pool.
Performance
Recent benchmarks show Ceph via Rook reaching approximately 32,000 IOPS and 890 MB/s sequential throughput with NVMe-backed OSDs. CPU usage is higher than alternatives because 3x replication and journaling add overhead, but the throughput ceiling is meaningfully higher than Longhorn or basic NFS solutions. The v1.19 release adds experimental NVMe over Fabrics (NVMe-oF) support, exposing RBD volumes via NVMe/TCP for both in-cluster pods and external hosts.
Minimum Production Requirements
| Resource | Minimum | Recommended |
|---|---|---|
| Nodes | 3 | 5+ |
| RAM per node | 8 GB | 16 GB |
| CPU per node | 4 cores | 8 cores |
| Disk per node | 1 raw, unformatted disk | Multiple NVMe |
| Network | 1 Gbps | 10 Gbps |
These requirements are not suggestions. Running Rook-Ceph on nodes with less than 8 GB RAM leads to OOM events during rebalancing operations. The 3-node minimum is a hard quorum requirement for Ceph Monitor stability.
Sample CephCluster CR
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
name: rook-ceph
namespace: rook-ceph
spec:
cephVersion:
image: quay.io/ceph/ceph:v19.2.0
dataDirHostPath: /var/lib/rook
mon:
count: 3
allowMultiplePerNode: false
storage:
useAllNodes: true
useAllDevices: false
devices:
- name: sdb
Best for: Bare-metal clusters requiring all three storage interfaces (block, file, object); high-throughput database and analytics workloads; teams with existing Ceph expertise; clusters of 5+ nodes with dedicated storage hardware.
OpenEBS
Current version: v4.4 (Mayastor engine); v4.2.0 released February 2025 | CNCF status: Incubating | License: Apache 2.0
OpenEBS is architecturally the most flexible option in this list. It supports multiple storage engines, and the choice of engine determines almost everything about your deployment. The two engines worth focusing on in 2025-2026 are:
- Replicated PV Mayastor: High-performance engine built on SPDK (Storage Performance Development Kit), running in user space to bypass the kernel I/O stack.
- Local PV LVM / Local PV ZFS: Node-local storage with LVM or ZFS as the backend. No replication, but full snapshot and quota semantics.
Mayastor Performance
Mayastor’s SPDK-based architecture delivers approximately 28,000 IOPS in benchmarks, between Longhorn and Ceph. Sequential throughput reaches around 720 MB/s. The key advantage over Ceph is CPU efficiency: Mayastor achieves competitive throughput with lower CPU overhead because it bypasses the kernel block layer entirely.
Mayastor now supports RDMA transport for volume targets, enabling high-throughput, low-latency connections for application hosts where RDMA-capable hardware is available. This is useful for latency-sensitive databases.
Recent Notable Features (v4.2-v4.4)
- DiskPool expansion via
maxExpansionparameter (online capacity increases without disruption). - Pool cordoning for maintenance windows (blocks new replica placement while draining).
- ZFS
zstd-fastcompression support on Local PV ZFS volumes. - Thin-pool LV automatic cleanup on Local PV LVM when the last volume is deleted.
- OpenShift-compatible Helm deployment for both Mayastor and Local PV LVM from a single chart.
Installation (Helm, Mayastor engine)
helm repo add openebs https://openebs.github.io/openebs
helm repo update
helm install openebs openebs/openebs \
--namespace openebs \
--create-namespace \
--set engines.replicated.mayastor.enabled=true
Note: Mayastor requires kernel 5.13 or later and hugepages configured on worker nodes:
# Required on each worker node
echo 1024 | sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
Best for: Performance-sensitive stateful workloads (PostgreSQL, MySQL, Kafka); teams that want a mix of replicated and local storage from a single operator; OpenShift environments.
LINSTOR (via Piraeus Operator)
Maintained by: LINBIT | License: Apache 2.0 (open source core); LINBIT SDS commercial subscription available | Operator: Piraeus Datastore
LINSTOR is LINBIT’s software-defined storage system, built on DRBD (the same replication technology used in high-availability Linux clusters for over two decades). The Piraeus Operator deploys and manages LINSTOR clusters natively inside Kubernetes.
Architecture
LINSTOR uses DRBD for synchronous block replication between nodes, backed by LVM thin pools. It supports 1, 2, or 3 replica configurations with intelligent placement balancing across nodes. Encryption at rest uses a LUKS layer managed by LINSTOR; in-transit encryption uses DRBD’s SSL/TLS channel.
Deployment
kubectl apply --server-side -k \
"https://github.com/piraeusdatastore/piraeus-operator//config/default?ref=v2"
This gives you an out-of-box storage pool and StorageClass immediately after applying. For teams already familiar with DRBD from traditional Linux HA clusters, the operational model will feel familiar.
Pricing
The open source Piraeus Operator and LINSTOR core are free. LINBIT offers a commercial subscription (LINBIT SDS) that includes enterprise support, additional tooling, and access to certified builds. Pricing is contact-based.
Best for: On-premises bare-metal deployments where teams have existing DRBD expertise; environments requiring synchronous replication with well-understood failure semantics; clusters where Ceph’s resource requirements are prohibitive.
TopoLVM
Maintained by: Cybozu (open sourced) | License: Apache 2.0 | CSI spec: Full (resize, snapshots with thin provisioning)
TopoLVM is fundamentally different from the replicated systems above. It provisions local LVM volumes on each node and uses the CSI topology feature to ensure pods are scheduled only on nodes where their volume exists. There is no cross-node replication.
The key capability that makes TopoLVM useful in production is capacity-aware scheduling: it extends the Kubernetes scheduler to prioritize nodes with more available storage capacity. Without this, local volume provisioners can silently place pods on nodes that are nearly full, causing failures at runtime.
# StorageClass for TopoLVM
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: topolvm-provisioner
provisioner: topolvm.io
parameters:
csi.storage.k8s.io/fstype: ext4
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
Best for: High-performance local storage for workloads that manage their own replication (Cassandra, Elasticsearch, some Kafka deployments); edge nodes with limited hardware where cluster-wide replication overhead is not justified.
Comparison Table
| Tool | Best For | Pricing | Open Source? | Key Strength |
|---|---|---|---|---|
| Longhorn | Edge, k3s, small clusters | Free | Yes (CNCF Incubating) | Simplest full-featured distributed block storage with built-in DR |
| Rook-Ceph | Large bare-metal, multi-protocol storage | Free | Yes (CNCF Graduated) | Block + file + object from one pool; highest throughput ceiling |
| OpenEBS Mayastor | Performance-critical stateful apps | Free | Yes (CNCF Incubating) | SPDK-based, kernel-bypass performance with flexible engine selection |
| LINSTOR (Piraeus) | On-prem with DRBD expertise | Free (OSS); paid support via LINBIT | Yes | DRBD-backed synchronous replication; lean resource profile |
| TopoLVM | High-IOPS local storage, self-replicating apps | Free | Yes | Capacity-aware scheduling; local NVMe performance without replication overhead |
| democratic-csi | TrueNAS/ZFS-backed shared storage | Free | Yes | Full CSI spec against existing ZFS infrastructure |
Performance Summary
Based on published benchmarks across comparable hardware configurations:
| Solution | Approx. IOPS | Sequential Throughput | CPU Overhead |
|---|---|---|---|
| Rook-Ceph (RBD, NVMe) | ~32,000 | ~890 MB/s | High (replication + journaling) |
| OpenEBS Mayastor | ~28,000 | ~720 MB/s | Low (user-space SPDK) |
| Longhorn | ~19,000 | ~610 MB/s | Moderate |
| TopoLVM (local) | Near-raw NVMe | Near-raw NVMe | Minimal |
Note: benchmarks vary significantly based on replica count, disk type, network bandwidth, and tuning. These figures represent community-published results on NVMe-backed nodes with 10 Gbps networking. Local storage (TopoLVM) delivers near-raw disk performance because it adds no replication overhead.
Recommendations by Use Case
Small clusters, edge, k3s, Rancher-managed: Use Longhorn. Install time is under 10 minutes, the UI makes operational tasks accessible, and the built-in DR volume feature covers the most important HA scenario without additional tooling.
Large bare-metal with diverse storage needs: Use Rook-Ceph. If you need block for stateful apps, CephFS for ReadWriteMany, and an S3-compatible object endpoint, Ceph gives you all three from one pool. Budget for at least 5 nodes and dedicated NVMe disks. Do not underestimate the operational learning curve.
Latency-sensitive databases on NVMe: Use OpenEBS Mayastor. The SPDK architecture removes kernel I/O overhead at the cost of a more complex installation (hugepages, kernel version requirements). For PostgreSQL or MySQL where storage latency directly translates to query latency, the performance gain is measurable.
On-premises with existing DRBD knowledge: Use LINSTOR via Piraeus. If your team already operates DRBD in traditional Linux HA contexts, LINSTOR extends that knowledge into Kubernetes rather than requiring a complete re-learning of a new storage system.
Self-replicating workloads needing raw speed: Use TopoLVM. Cassandra, Elasticsearch, and similar systems that handle their own data replication at the application layer get no benefit from a storage-layer replica. Pin them to nodes with local NVMe via TopoLVM and get near-raw disk performance with proper capacity-aware scheduling.
Existing TrueNAS or ZFS infrastructure: Use democratic-csi. If you already run TrueNAS or ZoL on Ubuntu, democratic-csi gives you full CSI semantics (resize, snapshots, clones) against your existing storage without migrating to a new system.
The right Kubernetes storage layer is the one that matches your node count, hardware profile, team expertise, and workload’s actual I/O pattern. None of these tools is universally “best.” What they share is active maintenance, production usage at scale, and real community support. The evaluation decision comes down to knowing your workload’s IOPS and latency requirements, your cluster size, and how much operational complexity your team can absorb.
π Free tool: K8s YAML Security Linter β paste any Kubernetes manifest and instantly catch security misconfigurations: missing resource limits, privileged containers, host network access, and more.
π οΈ Try These Free Tools
Paste your Kubernetes YAML to detect deprecated APIs before upgrading.
Plan your upgrade path with breaking change warnings and step-by-step guidance.
Paste your Chart.yaml to verify Kubernetes version compatibility.
Track These Releases