Rook Ceph on MicroK8s

21 March, 2023

Brett Milford

Storage is an important part of any application lifecycle. Kubernetes Container Storage Interface contains numerous drivers for storage 1. However, only a few are independent of the cloud provider, and even fewer retain significant properties for a resilient and reliable storage solution.

Furthermore, applications in a single cluster might rely on different types of storage, potentially leading to an issue of storage pool management fragmentation.

For instance, different applications might require:

  • File Storage: NFS
  • Block Storage: Longhorn
  • Object Storage: Minio

A powerful and flexible alternative exists. Ceph can provide:

  • File (CephFS)
  • Block (Rados Block Device)
  • Object (Rados Gateway)

Furthermore, it can be deployed onto a Kubernetes cluster with Rook, and its File and Block storage can be provisioned and managed via the CephCSI driver.

Getting hands-on and practical experience with Rook and Ceph can be tricky without dedicated hardware, but in this article, we will utilise MicroK8s to provide a complete and scalable solution.

In this example, we aim to deploy rook-ceph onto a single-node MicroK8s cluster and look at some the tweaks and trade-offs we can make for a minimal footprint deployment.

# MicroK8s

MicroK8s is a lightweight, easy-to-install Kubernetes distribution that can be used to deploy a Kubernetes cluster on a single machine. It aims to track the upstream Kubernetes project closely and to be minimal out of the box, and provides an “addons” interface for enabling extra features. It utilises Kubernetes components built into a single binary with Go routines (Kubelite) and, in a deviation from the norm, Dqlite for cluster state (rather than Etcd - although it can be enabled).

We can obtain a basic MicroK8s setup with the following 2:

$ snap install microk8s --classic
$ microk8s status --wait-ready

At an absolute minimum (and frankly, for most applications), you’ll require a cluster dns provider (i.e., CoreDNS) for service discovery, so you’ll need to enable this before continuing:

$ microk8s enable dns

# Rook

Rook is an open-source, cloud-native storage orchestrator for Kubernetes. It provides a platform for distributed storage systems that can run on a Kubernetes cluster.

The Rook operator project 3 provides several installation methods for Rook-Ceph 4.

For instance, you can deploy the Rook-Ceph components directly from the examples provided in the source tree:

$ git clone --single-branch --branch v1.11.2 https://github.com/rook/rook.git
cd rook/deploy/examples
kubectl create -f crds.yaml -f common.yaml -f operator.yaml
kubectl create -f cluster.yaml

But my preferred method is via the Helm charts. To do this we need to make a few key changes to our chart values.

## Rook Ceph Helm Charts

Start by obtaining the chart values.yaml from the Rook project 5. There are two charts, rook-ceph and rook-ceph-cluster.

I customise them as follows:

@@ -7,7 +7,7 @@
   repository: rook/ceph
   # -- Image tag
   # @default -- `master`
-  tag: VERSION
+  tag: v1.11.1
   # -- Image pull policy
   pullPolicy: IfNotPresent
 
@@ -452,7 +452,7 @@
 
   # -- Kubelet root directory path (if the Kubelet uses a different path for the `--root-dir` flag)
   # @default -- `/var/lib/kubelet`
-  kubeletDirPath:
+  kubeletDirPath: "/var/snap/microk8s/common/var/lib/kubelet"
 
   cephcsi:
     # -- Ceph CSI image
@@ -601,4 +601,4 @@
 monitoring:
   # -- Enable monitoring. Requires Prometheus to be pre-installed.
   # Enabling will also create RBAC rules to allow Operator to create ServiceMonitors
-  enabled: false
+  enabled: true
  1. We need to set the image tag to pull. Consult docker hub for a list of available tags.
  2. The Rook operator needs to know the path of the kubelet directory dir. MicroK8s is snap, and as such makes use of paths within its snap confinement for components it runs. The Kubelet root directory can be found at /var/snap/microk8s/common/var/lib/kubelet
  3. Finally, (and optionally) we enable monitoring, which will create the ServiceMonitor objects for use with the Prometheus Operator.
@@ -13,12 +13,11 @@
 kubeVersion:
 
 # -- Cluster ceph.conf override
-configOverride:
-# configOverride: |
-#   [global]
-#   mon_allow_pool_delete = true
-#   osd_pool_default_size = 3
-#   osd_pool_default_min_size = 2
+configOverride: |
+   [global]
+   osd_pool_default_size = 1
+   mon_warn_on_pool_no_redundancy = false
+   osd_memory_target = 2048Mi

Here, we’re using the configOverride for ceph.conf to make a couple of important changes. Note that these changes will be in the ceph.conf of every daemon that Rook deploys.

  1. osd_pool_default_size sets the default value for the number of replicas for objects in the pool. osd_pool_default_min_size sets the minimum number of written replicas for objects in the pool for an I/O to be acknowledged to the client. For a discussion of why we’re using single replica pools see Single Node Ceph Performance.

  2. We’re Turning off ceph warnings about redundance. This is necessary to achieve ceph HEALTH_OK when intentionally using single replica pools.

  3. We’re Setting the OSD memory target to 2G. In our pursuit of a compact cluster we have a few mechanisms to tune the resource usage of the cluster. We can set pod limits and requests which will dictate placement, and limits at which the OSD pod (and daemon) will be killed. This setting (osd_memory_target), however, applies directly to the OSD daemon and will prompt BlueStore to attempt to keep OSD heap memory usage under this target. In clusters with larger OSDs this value will likely result in poorer performance, so the default 4G should be preferred. With especially large OSDs it may even be beneficial to increase this value beyond the default 6 7.

 # Installs a debugging toolbox deployment
 toolbox:
@@ -44,9 +43,9 @@
 monitoring:
   # -- Enable Prometheus integration, will also create necessary RBAC rules to allow Operator to create ServiceMonitors.
   # Monitoring requires Prometheus to be pre-installed
-  enabled: false
+  enabled: true
   # -- Whether to create the Prometheus rules for Ceph alerts
-  createPrometheusRules: false
+  createPrometheusRules: true
   # -- The namespace in which to create the prometheus rules, if different from the rook cluster namespace.
   # If you have multiple rook-ceph clusters in the same k8s cluster, choose the same namespace (ideally, namespace with prometheus
   # deployed) to set rulesNamespace for all the clusters. Otherwise, you will get duplicate alerts with multiple alert definitions.
@@ -258,39 +257,40 @@
   #   # If no mgr annotations are set, prometheus scrape annotations will be set by default.
   #   mgr:
 
-  # labels:
-  #   all:
-  #   mon:
-  #   osd:
-  #   cleanup:
-  #   mgr:
-  #   prepareosd:
-  #   # monitoring is a list of key-value pairs. It is injected into all the monitoring resources created by operator.
-  #   # These labels can be passed as LabelSelector to Prometheus
-  #   monitoring:
+  labels:
+    # all:
+    # mon:
+    # osd:
+    # cleanup:
+    # mgr:
+    # prepareosd:
+    # monitoring is a list of key-value pairs. It is injected into all the monitoring resources created by operator.
+    # These labels can be passed as LabelSelector to Prometheus
+    monitoring:
+      release: kube-prometheus-stack

Here we’re setting up monitoring, and in particular, adding the necessary labeling to be picked up by the Prometheus stack.

@@ -116,17 +115,17 @@
   mon:
     # Set the number of mons to be started. Generally recommended to be 3.
     # For highest availability, an odd number of mons should be specified.
-    count: 3
+    count: 1
     # The mons should be on unique nodes. For production, at least 3 nodes are recommended for this reason.
     # Mons should only be allowed on the same node for test environments where data loss is acceptable.
-    allowMultiplePerNode: false
+    allowMultiplePerNode: true
 
   mgr:
     # When higher availability of the mgr is needed, increase the count to 2.
     # In that case, one mgr will be active and one in standby. When Ceph updates which
     # mgr is active, Rook will update the mgr services to match the active mgr.
-    count: 2
-    allowMultiplePerNode: false
+    count: 1
+    allowMultiplePerNode: true
     modules:
       # Several modules should not need to be included in this list. The "dashboard" and "monitoring" modules
       # are already enabled by other settings in the cluster CR.
  1. We reduce the daemon counts for mon and mgr to 1.
  2. Additionally we set allowMultiplePerNode to true. With out this field, on a single node deployment, Rook would try to deploy multiple mon pods onto our single node, and some of them would fail due to constraints. Whilst this is obviously reasonable behaviour, I contacted the Rook developers via their Slack channel about enabling this single node use case. Not only were they extremely receptive, they had this PR up and merged within a couple of weeks to support this non-critical use case.
@@ -135,7 +134,7 @@
 
   # enable the ceph dashboard for viewing cluster status
   dashboard:
-    enabled: true
+    enabled: false
     # serve the dashboard under a subpath (useful when you are accessing the dashboard via a reverse proxy)
     # urlPrefix: /ceph-dashboard
     # serve the dashboard at the given port.

Here we’re disabling the Ceph dashboard to minimise the daemons being deployed that aren’t essential to the service. I mainly used the Ceph dashboard as monitoring tool, so I choose to send metrics to Prometheus and monitor the cluster with Grafana instead.

   resources:
     mgr:
       limits:
-        cpu: "1000m"
+        #cpu: "1000m"
         memory: "1Gi"
       requests:
         cpu: "500m"
         memory: "512Mi"
     mon:
       limits:
-        cpu: "2000m"
+        #cpu: "2000m"
         memory: "2Gi"
       requests:
-        cpu: "1000m"
+        #cpu: "1000m"
         memory: "1Gi"
     osd:
       limits:
-        cpu: "2000m"
+        #cpu: "2000m"
         memory: "4Gi"
       requests:
         cpu: "1000m"
-        memory: "4Gi"
+        memory: "2Gi"
     prepareosd:
       # limits: It is not recommended to set limits on the OSD prepare job
       #         since it's a one-time burst for memory that must be allowed to
@@ -300,6 +300,8 @@
       #         limit should be added.  1200Mi may suffice for up to 15Ti
       #         OSDs ; for larger devices 2Gi may be required.
       #         cf. https://github.com/rook/rook/pull/11103
+      limits:
+        memory: "2Gi"
       requests:
         cpu: "500m"
         memory: "50Mi"
         

Here we’re setting device limits and requests. I set default limits for memory on all namespaces with the following limit range:

apiVersion: v1
kind: LimitRange
metadata:
  name: mem-limit-range
spec:
  limits:
  - default:
      memory: 512Mi
    defaultRequest:
      memory: 256Mi
    type: Container

As such, we need to explicitly set the prepareosd memory limit to override this default and avoid have the prepareosd pod killed before it can run to completion.

@@ -343,7 +345,8 @@
 
   storage: # cluster level storage configuration and selection
     useAllNodes: true
-    useAllDevices: true
+    useAllDevices: false
+    devicePathFilter: ^/dev/disk/by-id/ata-SPCC_Solid_State_Disk_.*
     # deviceFilter:
     # config:
     #   crushRoot: "custom-root" # specify a non-default root label for the CRUSH map
     

Here we’re Configuring devices for use as OSDs with a devicePathFilter which supports regex.

@@ -431,9 +434,9 @@
   - name: ceph-blockpool
     # see https://github.com/rook/rook/blob/master/Documentation/CRDs/Block-Storage/ceph-block-pool-crd.md#spec for available configuration
     spec:
-      failureDomain: host
+      failureDomain: osd
       replicated:
-        size: 3
+        size: 1
       # Enables collecting RBD per-image IO statistics by enabling dynamic OSD performance counters. Defaults to false.
       # For reference: https://docs.ceph.com/docs/master/mgr/prometheus/#rbd-io-statistics
       # enableRBDStats: true
@@ -474,8 +477,10 @@
         # Available for imageFormat: "2". Older releases of CSI RBD
         # support only the `layering` feature. The Linux kernel (KRBD) supports the
         # full feature complement as of 5.4
-        imageFeatures: layering
+        # imageFeatures: layering
+        imageFeatures: layering,fast-diff,object-map,deep-flatten,exclusive-lock
 
         # These secrets contain Ceph admin credentials.
         csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
         csi.storage.k8s.io/provisioner-secret-namespace: "{{ .Release.Namespace }}"

Here we’re configuring the RBD pool, including adding image features. By default rook utilises a limited image feature set to maximise compatibility, we change that here.

@@ -496,11 +501,11 @@
     spec:
       metadataPool:
         replicated:
-          size: 3
+          size: 1
       dataPools:
-        - failureDomain: host
+        - failureDomain: osd
           replicated:
-            size: 3
+            size: 1
           # Optional and highly recommended, 'data0' by default, see https://github.com/rook/rook/blob/master/Documentation/CRDs/Shared-Filesystem/ceph-filesystem-crd.md#pools
           name: data0
       metadataServer:

Here we’re modifying the Ceph FS pool configuration.

@@ -569,13 +574,13 @@
     # see https://github.com/rook/rook/blob/master/Documentation/CRDs/Object-Storage/ceph-object-store-crd.md#object-store-settings for available configuration
     spec:
       metadataPool:
-        failureDomain: host
+        failureDomain: osd
         replicated:
-          size: 3
+          size: 1
       dataPool:
-        failureDomain: host
+        failureDomain: osd
         erasureCoded:
-          dataChunks: 2
+          dataChunks: 3
           codingChunks: 1
       preservePoolsOnDelete: true
       gateway:

Here we’re modifying the Rados Gateway pool, which is configured with Erasure Coding (EC). By default, EC pools only work with operations that perform full object writes and appends such as RGW 8. In this circumstance I use an EC pool mostly out of curiosity, whilst I use single replica pools for CephFS and RBD. Finally, as with the other pools I set failureDomain to osd, which would spread objects or in the case of EC, parity data, across OSDs on the same node (which is critical for a single node deployment).

Finally we deploy the charts into the rook-ceph namespace:

helm repo add rook-release https://charts.rook.io/release
helm install --create-namespace --namespace rook-ceph rook-ceph rook-release/rook-ceph -f values-rook-ceph.yaml

helm install --create-namespace --namespace rook-ceph rook-ceph-cluster \
   --set operatorNamespace=rook-ceph rook-release/rook-ceph-cluster -f values-rook-ceph-cluster.yaml

# Using Rook Ceph storage

To use the the CephFS and RBD storage classes, you simply require a Persistent Volume Claim 9 targeting the relevant Storage Class and Ceph CSI will provision and manage the storage for you. Another useful mode is to set hostNetwork: true, which will make Ceph available outside the Kubernetes cluster. You can than use another management interface (e.g. OpenStack Cinder or Manila) to allocate and provision RBD and CephFS.

With Rados Gateway, we can utilise the S3 or Swift APIs to allocate and provision object storage buckets. We may also utilise the objectbucket API to provision storage.

For example:

apiVersion: objectbucket.io/v1alpha1
kind: ObjectBucketClaim
metadata:
  name: www-ceph-bucket
  namespace: www
spec:
  bucketName: www
  storageClassName: ceph-bucket

Here, bucketName will produce a static bucket name, whilst generateBucketName will guarantee to generate a unique name. Finally, it will also create corresponding Config Maps and Secrets with the endpoint details and bucket credentials.

# Troubleshooting and Maintenance

## Deploying the Ceph toolbox pod for troubleshooting 10

To deploy:

$ kubectl apply -f https://raw.githubusercontent.com/rook/rook/master/deploy/examples/toolbox.yaml

To exec:

$ kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash

To delete:

$ kubectl -n rook-ceph delete deployment rook-ceph-tools

## Default storage class conflicts

MicroK8s ships with an optional addon for Host Path storage. After deploying Rook Ceph, I found both Ceph RBD and MicroK8s Host Path storage were configured as the Default. This caused problems where applications deployed without specifying a particular storage class would get the wrong storage. I found this behaviour to be unintuitive, but apparently its completely consistent with Kubernetes.

Either way it can be easily rectified by removing the is-default-class annotation from microk8s-hostpath storage class 11.

$ kubectl patch storageclass microk8s-hostpath \
    -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'

## Rook won’t provision with “failed to start device discovery daemonset”

By default, MicroK8s does not pass the --allow-privileged=true to the API server. However, some addons will modify MicroK8s to do so. Therefore, you may or may not encounter this error depending on the addons you’ve enabled:

failed to start device discovery daemonset: Error starting discover daemonset: failed to create rook-discover daemon set. DaemonSet.apps "rook-discover" is invalid: spec.template.spec.containers[0].securityContext.privileged: Forbidden: disallowed by cluster policy

If you don’t require any addons that enable this flag, you can manually add --allow-privileged=true to /var/snap/microk8s/current/args/kube-apiserver.

## Ceph OSDs CrashLoop after suspend

My single node cluster is primarily used for testing and experimental purposes, and I occasionally suspend, poweroff or restart the machine. Occasionally this leaves Rook’s OSD pods in a CrashLoop state. I tracked this back to an issue with the Operator misunderstanding the state of the OSD pods.

This can be checked by following the Operator’s logs with:

$ kubectl -n rook-ceph logs -l "app=rook-ceph-operator" -f

The Operator can be restarted with:

$ kubectl -n rook-ceph get deployments
$ kubectl -n rook-ceph rollout restart deployment rook-ceph-operator

## Manually setting ceph osd target

As mentioned in the Helm chart section I use a configOverride to set osd_memory_target. You can also run Ceph commands from the Ceph toolbox to experiment directly with this tunable.

$ ceph config set osd.3 osd_memory_target 2048Mi

## Modifying the Ceph crush map for a single node cluster

We modified our Rook-Ceph Helm charts to include failureDomain: osd, so that replicas would spread be across OSDs on the same host rather than across multiple hosts. However, if we did not do this, we would end up with a PG state undersized 12 13.

From the Ceph toolbox:

  1. Check ceph -s. We should see pg state undersized.
  2. Check current osd tree configuration:
[root@rook-ceph-tools-7d9467775-m2knb /]# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME         STATUS  REWEIGHT  PRI-AFF
-1         7.27759  root default
-3         7.27759      host orpheus
 0    hdd  1.81940          osd.0         up   1.00000  1.00000
 1    hdd  1.81940          osd.1         up   1.00000  1.00000
 2    hdd  1.81940          osd.2         up   1.00000  1.00000
 3    hdd  1.81940          osd.3         up   1.00000  1.00000
  1. Get osd crush rule:
[root@rook-ceph-tools-7d9467775-m2knb /]# ceph osd crush rule ls
replicated_rule
[root@rook-ceph-tools-7d9467775-m2knb /]# ceph osd crush rule dump
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "ruleset": 0,
        "type": 1,
        "min_size": 1,
        "max_size": 10,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    }
]
  1. Get crush map and convert it to text:
# ceph osd getcrushmap -o crush.bin
# crushtool -d crush.bin -o crush.orig.txt
  1. Modify it as follows:
        step chooseleaf firstn 0 type host

To

        step chooseleaf firstn 0 type osd

In replicated_rule

  1. Convert it back to binary:
# crushtool -c crush.orig.txt -o crush.bin.new
  1. Set new crush map:
ceph osd setcrushmap -i crush.bin.new
  1. Check ceph -s again.

## Removing an OSD

  1. Set the OSD as out and wait for backfill to finish 14 15 16:
$ ceph osd out osd.3
watch ceph -s
  1. Find the backing devices of the OSD:
ceph osd metadata 3
  1. Update the cluster Custom Resource so that it doesn’t re-provision the OSD 17 18.

  2. Mark the OSD down, then purge:

ceph osd down osd.3
ceph osd purge osd.3 --yes-i-really-mean-it
  1. Delete the Rook OSD deployment: This can be done automatically if removeOSDsIfOutAndSafeToRemove: true is in the cluster Custom Resource.
$ kubectl delete deployment -n rook-ceph rook-ceph-osd-3
  1. a) Remove the backing data 15.
$ sgdisk --zap-all /dev/sdc
  1. b) If you’re not rebooting the node then find the associated LVM pv/vg and unmap:
$ sudo pvdisplay /dev/sdc
$ ls /dev/mapper/ceph--block--$VG_UUID
$ dmsetup remove $PATH_TO_VG
$ sudo dmsetup remove

## Handy notes

Retrieve information about the Ceph cluster 19:

$ kubectl -n rook-ceph describe cephcluster rook-ceph

# Single Node Ceph Performance

In the initial iterations of this project, I noticed that driving multiple copies of I/O on the hardware I am using made the cluster unusable for testing purposes due to saturation of storage controller bandwidth.

Server Specs:

Intel(R) Xeon(R) CPU X5690 @ 3.47GHz (12x)
96GiB System Memory 16GiB DIMM DDR3 1333 MHz (0.8 ns)(6x)
1024GB SPCC Solid State scsi (3x)
5.15.0-67-generic #74~20.04.1-Ubuntu

In particular, I observed that:

  • Although I don’t think individual I/O to a particular drive was very high, there were spikes in w_await on particular drives which were amplified through the Logical Volume.
  • There were a lot of interrupts on SATA controller.
  • Changing replicas to 1x largely solves this.

I concluded that the generally slow performance was due to Ceph writing three copies of data across one SATA bus. Single copy pools did no exhibit these issues however, I need to investigate if is Ceph still durable, and if actions like scrubbing still do anything relevant or can they be turned off.

An alternative is to use EC pools where only parity data is replicated, thus reducing I/O through the storage controller. However, they have their own draw backs. For example:

  • Pools requiring OMAP data can’t use EC (FS and RGW metadata pools)20, and the durability of the data within the EC pool is only as good as the durability of its metadata. If I ultimately have a single replica metadata pool, especially in a small cluster, the chances of failure or data loss are not significantly mitigated.
  • EC pools need to be created upfront as they can’t later be converted, so this is a decision that should be made before ingesting a lot of data.

For the purpose of this demonstration, EC pools wouldn’t significantly improve durability and we do not require Ceph to be redundant or durable so I opted to simply not replicate the data 21.

We configured OSD pool sizes accordingly in the Helm chart section, however we can also do so with the following commands from the Ceph toolbox:

# min_size <= size
$ ceph pool ls # get pool names
$ ceph osd pool set {poolname} min_size {num-replicas}
$ ceph osd pool set {poolname} size {num-replicas}

Note that I/O will stop if num-replicas < min_size. Specifically “This ensures that no object in the data pool will receive I/O with fewer than min_size replicas.” 21

# Footnotes and References