The cloud is my computer

I’m sure most homelabbers are familiar with the axiom, “The cloud is someone else’s computer.” We’re well aware that this is the case — it’s why we do what we do!

For the longest time, my “homelab” has been kind of a mess; it (as most probably are) was a hodgepodge of services running on a cornucopia of devices. It started with a few personal services running on my desktop (“It’d be nice to have a personal git server; why don’t I spin up Gitea?”), which I then kept on 24/7 (“It’d be nice to keep qBittorrent running overnight”). I eventually had so much crap running on my desktop that I was honestly kind of scared to touch it, lest something broke; I used my laptop for my day-to-day.

Wanting the desktop back, I dug up an old Raspberry Pi and moved most of the lighter software there. I was very proud of containerizing everything — some software didn’t have Docker containers provided; some that did had poorly written Dockerfiles. Of course, the Pi was horrifically under-powered, and its resources were quickly saturated. Easy solution — get another Pi! And a real mini PC to host software with higher memory requirements!

Obviously, management was a nightmare. I didn’t update anything; as long as it worked, I didn’t touch it. I had no idea which host was hosting what software, so I’d ssh into each host sequentially and cat docker-compose.yml to check if I was in the right machine. Big yikes.

NixOS

So, how to fix? Having gotten my desktop back, I’d decided it was time for a fresh start and installed NixOS, a distro I’d heard about from Shopify’s blog. After having been in the Nix ecosystem for a while, I stumbled across NixOps, but I (like Xe) didn’t like how it managed secrets, and it seemed like abandonware, anyway; I wasn’t a fan of other Nix-based alternatives, either.

In general, I find Nix much more difficult than other Linux distributions. It’s such a huge paradigm shift, and nixpkgs is such a huge project, that documentation is often lacking. While this is fine for a desktop machine, it’s not what I look for in server software; in fact, the x86 mini PC ran Arch, because its documentation is that good. When something breaks, I want to be able to immediately fix it — or at least have instructions on how to do so easily accessible.

Kubernetes

Kubernetes was actually my first choice for orchestration, not NixOS. Like Nix, Kubernetes has the advantage of “configuration as code”. Plus, I was familiar with its architecture at a surface-level, having used GKE extensively at $DAYJOB, and I felt comfortable enough that I’d know what to do in a minor emergency. (I know Kubernetes can fail in a seemingly infinite number of spectacularly unique ways, but I wasn’t planning on messing with Kubernetes itself too much.)

I began setting up a virtualized Kubernetes cluster inside VirtualBox on my desktop machine (begrudgingly so, as I didn’t have any other machines to work with). I knew the following: I wanted to use Arch, and I was likely going to have to repeat the install process a few times. Having become a #hachydermian, I’d seen Kris Nóva’s bash scripting expertise on her Twitch stream, and I was inspired! I wrote scripts to bootstrap Arch and (learning as I went) vanilla Kubernetes, Flannel, MetalLB, and Traefik. Finally, I had Kubernetes fully functional… inside VirtualBox.

Ceph

Finally, I pulled the trigger and picked up a PowerEdge R430 from a local refurbisher. I got it set up, which took a few days in and of itself. I didn’t have a VGA cable lying around (how is this even possible?), and the server didn’t come with an iDRAC Enterprise license, so I had no way to get video output to actually run the bootstrap script. I eventually discovered that Dell had offered a 240-day free trial of iDRAC Enterprise (for which the license was still floating around online), and got to work.

Everything went swimmingly — until I realized I needed storage. Being familiar with GKE, I liked the concept of dynamically-provisioned PersistentVolumeClaims, and I did not want to use hostPath (Kubernetes’ documentation even says hostPath is for “development and testing”). Perusing the list of CSI provisioners, Ceph caught my eye.

I watched “Into to Ceph”, and it seemed like a no-brainer. If I ever got another server, it would be drop-dead simple to add its drives to the cluster, and I could manage its configuration from within the Kubernetes cluster itself using Rook, requiring no out-of-band setup or config.

So, I added to the bootstrap scripts to enable the required kernel modules and with Rook’s configuration, telling Ceph to use all available drives and to use an osd failure domain rather than its default host, since this would be a single-node cluster.

To my surprise, the installation failed spectacularly. ceph status would sometimes report communication errors and sometimes hang indefinitely, and, when it did complete, reported HEALTH_ERR; mons would CrashLoopBackOff; drives weren’t assigned an osd… I had no idea what was going on. After much Googling, I found a GitHub issue detailing similar symptoms, with newer Linux kernels seemingly the cause for some commenters. I was on 6.1.4, the bleeding-edge default, so this seemed like it could have also been the issue for me; unfortunately, even after trying with the linux-lts kernel (5.15.82 at the time of writing), the CrashLoopBackOffs and communication errors persisted.

Apparently, Ceph has only been officially validated on outdated versions of CentOS and Ubuntu, and I didn’t want to repeat my “no upgrades” failure of the then-current. I braced myself for the coming trial-and-error.

Talos + Sidero Metal

After distro-hopping a few times, both OS and Kubernetes — k3s on Arch, MicroK8s on Ubuntu (apt sucks, by the way), and TrueNAS Scale — I found CoreOS. CoreOS was a tiny OS whose sole purpose was to run containerized applications; it has been discontinued after having been acquired by Red Hat. I didn’t like the “automatic updates” feature of its successors, Fedora CoreOS and CentOS Stream, but I knew what I was looking for.

I asked around, and @malte recommended Garden Linux. Unfortunately, I couldn’t get its builds working on my NixOS desktop, so I pressed onwards and discovered Talos, a CNCF project that, (seemingly) like Garden Linux, provides an immutable boot image whose sole purpose is to run Kubernetes.

Talos provides supplementary software for server provisioning (exactly what I was trying to do!) called Sidero Metal, whose setup was surprisingly simple. In general:

Spin up a temporary local “management plane” and let it discover the Servers you have available.
Confirm you want Sidero Metal to manage those Servers, adding them to the pool (it’ll wipe the drives once you do).
Configure a Cluster (along with a few other resources) that uses those Servers. Sidero Metal automatically PXE boots those servers and installs Kubernetes.
Move the “management plane” into the new Cluster — at this point, the cluster is self contained, effectively managing itself!

Luckily, the Rook/Ceph installation went smoothly, and I was able to provision a VolumeClaim without issue!

I’m actually glad the Arch install didn’t go as planned, because I can imagine a future where something tiny changes and I have to spend far too long debugging the bootstrap script. I learned a lot writing it (both about Bash and about Kubernetes), and I’m glad I did so… but I’m glad to not have to take on another maintenance task.

I slowly started moving things over, starting with the easy stuff and working up to the services that I didn’t previously have containerized. Today, a little less than a month later, I powered off the last of the smorgasborg of Pis and the like. Everything is defined in code, pushed to GitLab, and running on the cluster — even the cluster itself, with the help of Sidero Metal! I essentially have a cloud-native environment sitting in my living room, and I already can’t wait to expand it. 🎉