Kubernetes Community Days Munich 2023

18 July 2023 Munich, Germany

Honey, I Shrunk the Datacenter: Operating Bare-Metal Kubernetes at Home for Fun and Data Sovereignty

slides.pdf

Abstract

Operating a Kubernetes cluster inside your home is not only inherently fun but also provides an excellent environment for learning and experimentation. Last but not least, keeping all data within one’s own four walls is an essential prerequisite for consequent data sovereignty when self-hosting for oneself or acquaintances.

This talk presents a number of requirements that can be encountered when running Kubernetes in the home environment. Based on the experiences of half a decade, a vanilla Kubernetes in a heterogeneous environment (amd64/arm64) turns out to be a flexible solution allowing continuous replacement and upgrade of both hardware and software. The speaker’s individual setup with regard to hardware selection and system architecture will be briefly showcased in order to present potential solutions for energy efficiency, failover and resilience, encryption at rest, storage, computing, network, load balancing, identity management and backup. In addition to discussing challenges in running bare-metal Kubernetes (in the home environment), this talk is intended to inspire and motivate running your own cluster.

Reclaim your data sovereignty!

Transcript

Intro

Hey everyone and welcome to my talk! I hope you had a great and inspiring conference here.

echo $(whoami)

This presentation can also be found online by scanning this QR code.

My name is Arik Grahl and I work at SysEleven. SysEleven is a Berlin-based hosting company with datacenters in Berlin and Frankfurt and a dedicated network infrastructure. We have an OpenStack-based cloud offering and a managed Kubernetes offering on top of it.

My team develops the highest abstraction layer, which is a software supply chain management system with application lifecycle management on top of this stack. My official role is an SRE but my daily business mostly deals with Kubernetes controllers and operators which we develop in golang. This is where also my roots are: I have a background as a full-stack web developer but was always used to operate the stack down to bare-metal by myself. I am a passionate free and open source software advocate and am a notorious self-hoster and would consider myself as a data hoarder.

If you like to reach out, you can contact me via any of this channels.

Why Data Sovereignty? (1/3)

Before we answer the motivational questions why one would strive for data sovereignty, let’s have a closer look what data sovereignty actually means.

For me data sovereignty is an implementation of strict privacy.

Why Data Sovereignty? (2/3)

With the term of privacy in mind, we now can focus on the question why one would have data sovereignty.

The main factor is described by surveilance capitalism, a term coined by the US economist Shoshana Zuboff. By this she understands a capitalist system that collects personal data by technical means, gather information about behavior, analyze it and process it for market-economy decision-making. But the centralization of personal data can be used beyond this market-economy decision-making: Probably everyone has heard of Cambridge Analytica, the self-described “global election management agency”. The company has become known after it collected and analyzed personal data of facebook users in order to generate customized political advertising, which was used to influence the US elections.

Why Data Sovereignty? (3/3)

So what to do, to at least mitigate some factors of those problems? Well, beside political influence one implication is for me to shift towards self-hosting. At its very core is that used services must be strictly self-owned, which means in other words that data must be physical present at sites where ideally only you have access.

Self-hosting Requirements (1/6)

Before we go through the hardware and software of such a setup, let’s briefly discuss which requirements it should have. These requirements reflect my personal needs and those of my family and friends using it. Obviously, if you plan to self-host in a more professional way, your requirements may look different.

This being said, the first requirement is cost-effectiveness. The costs of buying and maintaining it are privately covered by me and as I do not maintain services commercially I have no monetary equivalent for the service.

Self-hosting Requirements (2/6)

The second requirement is energy efficiency. Since energy costs money, it is somewhat related to the previous requirement but also the climate crisis is real, so I do not want to exploit the planet by wasting energy. Another aspect is that much energy translate to much heat, which is released in my apartment. While this side effect is quite welcome in the winter, in the summer you do not want to emit unnecessary heat into the apartment.

Self-hosting Requirements (3/6)

The third requirement is that hardware should be generally available. I am strictly limiting myself to commodity hardware. No specialized hardware is allowed and the broader the availability of spare parts is, the better.

Self-hosting Requirements (4/6)

The fourth requirement is, what I would call to be opinionated. This would mean for me especially not sacrifying most of the cloud-native patterns and with that I would also assume some kind of modern infrastructure.

Self-hosting Requirements (5/6)

My last requirement is to provide both sufficient performance and availability. While I will not perform any Big Data or High-Performance Computing, the setup should be sufficiently fast to handle medium-intensive workloads. Furthermore, it should provide best availability and partition tolerance. So, speaking in terms of the CAP theorem: My data is to valuable to sacrify consistency and although I give my best to minimize partition tolerance, which I cannot prevent completely, I am most likely to sacrify availability.

Self-hosting Requirements (6/6)

To wrap this up, here are my non-goals for the setup. I will and cannot provide scalability, especially not to a very large extend. The setup will be instead quite static. Providing setup wit a clean multi-tenancy concept is also out of scope and for me personally not required.

Power Supply (1/2)

With the requirements in place, we can now focus on the actual setup. We start with the lowest level, which is the power supply.

Data centers usually have redundant main power distribution with an automatic transfer switch. They are also equipped with diesel generators and uninterruptible power supplies.

For obvious practical reasons, the home setup will be limited to a single UPS. This particular model here provides surge protection and power of up to 240 W. With that peak power the integrated battery lasts for at least 5 minutes but my tests provide significant longer runtimes. In contrast to most modern appliances, the battery is even replacable and lasts in the order of years, based on practical experiences.

Power Supply (2/2)

Any large plant incorporates some emergency power off and since we have a UPS in place we can no longer rely on circuit breakers or pulling the plug.

That is the reason why I place a power outlet strip, so that I can easily toggle the switch. The reasons are two-folded: Mainly for safety, to mitigate any kind of malfunction. On the other hand for security, where third persons may gain access over the hardware.

Server Rack

Data centers obviously have compartments and racks which host the servers. From the perspective of a private person these are quite pricey and most of the hardware I am using will not be rack-mountable anyways.

That’s why I can encourage to build a wooden server rack yourself. It is inexpensive and you can optimize its design for any individual dimensions. Since my setup looks like a shelf it is as well very maintenance friendly and any device is easily accessable.

Networking (1/4)

Unfortunately, home internet involves proprietary protocols. In Germany this is mostly VDSL2, sometimes with Vectoring, over cupper telephone cable and DOCIS over coaxial cable. This means that the provider’s router is de facto mandatory.

For sanity, I consider everything beyond the WAN adapter of the router as a black box and concentrate on my local network.

Networking (2/4)

Data centers come with sophisticated internal network infrastructure, including VLAN. Those infrastructures need some effort to be set up and maintained. An unmanaged switch on the other hands represents a solution with low maintenance, while obviously lacking those features. But the main reason to stick to such a “boring” solution is, that most of the deployed midbudget devices do not support bandwidths which are higher than one gigabit per second.

This specific device here has 16 gigabit ethernet ports, which allow me to connect all my devices and even have some spare ports to interconnect different rooms of my appartment.

Networking (3/4)

So, overall the network architecture looks quite simple: At the very heart I have deployed a switch which interconnects the router on the one hand and all nodes of the cluster on the other hand each via ethernet.

Networking (4/4)

One more, mostly sofware-based, addition to the network architecture is WireGuard as my VPN solution of choice, which both provides connecting clients from outside networks and interconnecting other private IP networks of friends.

Control Plane (1/3)

After we have considered the power supply, the server rack and the network infrastructure, we finally come to the nodes of the cluster itsself. There are three types of nodes I am distinguishing here: The control plane nodes which will take care of control tasks, the storage nodes, which will be mainly responsible to persist any kind of data, and the worker nodes, the compute tier, which will deal with arbitrary execution of workloads.

We start with the nodes of the control plane: For those nodes, I am using the latest RaspberryPI 4, which is equipped with four 64 bit ARM CPUs and four gigabytes of RAM. It has also two USB3.0 ports and gigabit ethernet. I have connected a 256 gigabyte SSD via one of the USB-ports. Since I am operating three of those nodes for high availability reasons, I have mounted those to a cluster case, which you can see here on the picture. Don’t get confused, the picture shows older RaspberryPIs and also four of them. I have extended this cluster with a custom 3D printed mount for the connected SSDs.

Control Plane (2/3)

Including the control plane nodes, our overall architecture would now look like the following. We have three control plane nodes, each equipped with an SSD via USB.

Every of this nodes has a gigabit ethernet connection to the switch.

Control Plane (3/3)

The control plane nodes is basically a Kubernetes control plane node, consisting of the usual components: An etcd for the datastore, a kube API server, controller manager and scheduler. A CoreDNS completes this setup.

It is worth noting that I am using no particular downsized Kubernetes distribution here, but Vanilla Kubernetes managed with kubeadm.

Storage (1/3)

The next group of nodes are the storage nodes. Here, I am deploying an Odroid HC4, which has also a quadcore ARM 64 bit CPU and 4 gigabytes of RAM. The special feature about this system on a chip is the presence of 2 SATA ports which are interconnected with PCIexpress. And it also features a gigabit ethernet port. I have equipped one SATA port with a 512 gigabyte SSD and the other one with a 12 terrabyte HDD.

Again, for high availability reasons I have deployed three of those nodes.

Storage (2/3)

Taking into account the storage nodes would lead to this system architecture. Every storage node is equipped with an SSD and a HDD and they are each as well interconnected via a gigabit ethernet connection to the switch.

Storage (3/3)

The storage nodes represent essiantially a Ceph cluster, where the SSDs and HDDs each form a storage pool. The Ceph cluster provides all usual storage types such as a block storage in form of RBD, distributed file system via CephFS and an object storage by RGW.

I use Rook as a Ceph distribution, which offers me a fully managed Ceph inside a Kubernetes cluster.

Compute (1/2)

Last but not least are the compute nodes, as worker for the main workloads inside of the cluster.

Therefore, I am using Intel NUCs with a quadcore 64 bit CPU. I have utilized 16 of the total of 32 gigabyte of available RAM. Furthermore, the NUCs are equipped with an M.2 port and a gigabit ethernet port. I have provided 512 gigabyte of NVMe storage via the M.2 port for ephemeral purposes.

Compute (2/2)

The overall architecture is now completed with the worker nodes.

Each of them are equipped with fast NVMe storage and are connected via gigabit ethernet, like the other nodes.

Maintenance Console

After we have these nodes in place, they need to be set up and maintained. Datacenters have their different maintenance console solutions in place to do low level interactions such as BIOS/UEFI settings or installing a new operating system.

My cost-effective yet flexible solution consists of a cheap HDMI capture card together with a spare RaspberryPI 4. This has USB On-The-Go enabled, which is capable of emulating arbitrary USB devices. This includes in particular human interface devices and mass storage as well.

To sum this up, we can receive a video signal via the HDMI capture card and emulate a mouse, keyboard and even an USB-stick with the single OTG-enabled USB port of the RaspberryPi 4. PiKVM is a linux distribution, which is optimized for this use case. After booting it offers a convenient webinterface, which allows low-latency interactions on the target system. One can even select from a list of ISO images from the webinterface, which can be provided to the target system for booting.

Hardware Reset

In every system which is sufficiently complex there will be failures, which can only be resolved by an hardware reset. In data centers, there is usually an interface for this and for virtual machines it is a software feature anyway.

My solution to it features ZigBee smart plugs, which can be switched off and on wireless. Most SoCs will boot, whenever the power is available again, and also the NUCs can be configured accordingly, by which this solution behaves like a hardware reset. It is very inexpensive and it even enables power monitoring as a side effect.

Encryption at Rest

Data centers have a 24/7 security guard and razor wire in place and used HDDs undergo a destruction protocol to protect the data. For that reason most consumers do not encrypt the data at rest. In a home setup you cannot guarantee those protection measurements and also theft cannot be excluded.

For that reason I strictly encrypt all of my disks with LUKS. I utilize a full disk encryption, which only leaves the boot partition unencrypted. LUKS offers relatively high security, while still maintaining good performance thanks to hardware acceleration. Every node has a Dropbear SSH server in its initramfs to which it will boot. This enables then remote unlock of disks conviently and which is further automated by Ansible.

Backups (1/3)

After we have talked about data security, let’s have a look at data safety. Usually, you should perform backups with the 3-2-1 rule: There should be three copies of data on two different media, whereby one copy should be off-site. The last aspect of this rule implies encryption at rest to still maintain data sovereignty.

For me proven solutions are both borgbackup and restic, which can also deals with object stores as remote targets and seems overall more modern. I can as well recommend the Hetzner storage box as a remote target, which is very cost-effective. It is very flexible providing support for both borg and restic, and can be easily scaled up and down.

Backups (2/3)

All of my backups follow a certain pattern, which I want to demonstrate you briefly now. A backup is a Kubernetes CronJob with a defined schedule for example every night at 4 o’clock. It has the concurrencyPolicy set to Forbid, so that new backup attempts will not be started until the previous one has terminated.

In this example I am making a backup of the PVC of one of my workloads. I am using a container which includes restic and provide both the backup’s secret key and its repository endpoint as environment variables. The Kubernetes secret also includes the SSH key, which I will provide via a filesystem mount. The PVC which is subject of this backup is mounted as well and will be our working directory as well.

Backups (3/3)

At the very heart of this CronJob are the invocations of restic. The first part of the command performs a backup with a high compression of the current working directory, noted by the dot here. Every backup run is followed by a consolidation, which is the second part of this command. The forget subcommand together with --prune can apply a consolidation policy specified by the arguments with the --keep prefix. In this example we only keep this week’s daily snapshot, together with this month’s weekly snapshots, together with this year’s monthly snapshots. The -1 indicates that we will keep yearly snapshots forever.

Based on my experience, this approach is pretty solid and very flexible as well. One could also backup host directories by mounting them in the container or backup database dumps by streaming it to the standard-in of restic backup.

Operation

Setting up an infrastructure is all fun and games until you have to deal with challenges of day-two operations. It is not a secret that automation saves the day here.

I use Ansible to manage the bare-metal infrastructure. With infrastructure as code I can easily re-setup specific nodes and performs maintenance such as updates or reboots. Every playbook has a variant which performs its tasks in a rolling fashion. Since I have three nodes of every type and mostly high-availability in place, I usually do not have to intervene to maintain accepatable availability of my services.

With Helm and helmfile I manage most of my Kubernetes application stacks. With kubectl I deal with all unpackaged application and other ad-hoc workloads.

Loadbalancer (1/2)

Data centers usually provide hardware solutions for loadbalancing and providing stable IP endpoints to the infrastructure.

In a home setup we want a pure software solution, for which keepalived is a viable candidat. It can perform IP Virtual Server Layer 4 loadbalancing and implements the Virtual Router Redundancy Protocol for high-availability.

Loadbalancer (2/2)

I would like to demonstrate you this protocol in a practical example.

Every worker node has a private local IP address and keepalived gets installed on every of this node. The worker then exchange with VRRP and eventually one of the worker nodes becomes the leader. The leader then gets another local virtual IP address assigned, which is from the client’s perspective a stable IP endpoint. Once the worker, which is leader, fails for whatever reason or goes down due to maintenance, the virtual IP address gets passed on to another worker node due to the VRRP exchange of the remaining nodes. After the ARP tables are in sync again the client can keep on communicating via the same IP address to the cluster, now via a different worker node.

Ingress

The HTTP ingress is pretty straightforward. Just like for usual cloud setups one can rely on the Ingress Nginx Controller here.

I have deployed them as a DaemonSet on every worker node and since I do not have a traditional loadbalancer per se, I have configured the controller’s Service’s externalIPs to pointing to the virtual IP address, which I have just illustrated.

Database

For simplicity I use a central PostgreSQL cluster which is shared among all my services.

I found CloudNative PG as a PostgreSQL operator as very good solution. It provides high availability and full lifecycle management of the PostgreSQL cluster. It abstracts PgBouncer as a connection pooler as well with the common modes session, transaction and statement. This translates to Service endpoints, allowing flexibility to choose the desired one a per application basis.

Monitoring

Just like the ingress, the monitoring stack is also pretty straightforward.

One can rely on the Kube Prometheus Stack here, just like in most cloudnative environments as well. It bundles Prometheus, Node Exporter and Alertmanager together with a Grafana frontend. Everthing is pre-configured to fit together and Grafana comes pre-loaded with a lot of beautiful dashboards.

Authentication and Authorization (1/2)

The more services are available, the greater the need to authenticate them centrally. For that reason I am operating Keycloak as an Identity and Access Management. I mainly rely on the OpenID Connect integration, which allows me rich integration in most services.

As an addition I am using the OAuth2 Proxy for HTTP endpoint authentication. It is a generic solution and a good match for any service which does not have any authentication concept in place. One can even implement simple authorization through defined groups.

Authentication and Authorization (2/2)

Since I found this OAuth2 Proxy as a pretty underrated solution, I want to briefly demonstrate how generic and easy the integration it is. A given Ingress object receives essentially a series of annotations. The auth-url can have an optional comma separated list of allowed_groups for simple authorization. With auth-response-headers one can specify which user fields should be forwarded to the downstream service, which should be authenticated. This headers can be consumed by the application to identify the user for instance. The rest of the annotations are boiler plate code.

Last but not least a second mostly generic Ingress object is introduced. The purpose is to route the previously specified /oauth2 path under the same domain of the original ingress to the OAuth2 Proxy, which can be operated in a central fashion for multiple of services. This will enable generic authentication and even authorization, forwarding unauthenticated users to Keycloak via the Oauth2 Proxy.

Development Platform

I am a developer at heart, so of course I need to have a platform for those kind of purposes.

I find GitLab as my battery-included solution of choice. It provides a version control system and is highly integrated with the corresponding CI infrastructure. Also the container registry is a valuable asset and an essential part of my cluster infrastructure.

Communication (1/2)

A crucial aspect of data sovereignty is communication. Even today, first and foremost email is a central service. As you can certainly already see from the logo designs, this is very traditional software.

I have a containerized Postfix for relaying outgoing email to an upstream SMTP server. I am operating Dovecot as my IMAP server for representing the central long term storage of my email. With Fetchmail I am polling an upstream mailbox and downloading it to my IMAP server.

I am aware that this setup relies on another managed email service. But with a private IP address it is practically impossible to maintain a full email service yourself. The fact that email communication in a federated fashion is almost never end-to-end encrypted, discourages me to go the extra mile here.

Communication (2/2)

The second aspect of communication is chat, for which I use mainly Mattermost.

A while ago I started to also consider Matrix, since it is a federated and also offers communication bridges to other systems.

Groupware and Collaboration

No home setup would be complete without a Nextcloud. It has its rough edges and quirks but the feature set and ecosystem is quite large.

At its core is the file sharing via browser, desktop client, mobile app or everything which speaks WebDAV. It features a calendar and contact management via browser or via CalDAV and CardDAV-enable clients, respectively. Also notes, news reader, GPS tracks and a recipe management via its respective apps are valuable companions for everyday life.

Media and Entertainment

Another valuable addition for everyday life is Jellyfin as a media and entertainment solution. It is capable of dealing with different media types: music, movies and series, among others.

It has an ecosystem of clients: While the browser is a good general purpose web interface, there are apps for AndroidTV and AppleTV as well as mobile apps. Not surprisingly, it has an excellent Kodi integration, which delegates media management to Jellyfin.

Home Automation and IoT (1/2)

A Kubernetes cluster on bare-metal at home is also an environment to setup different home automations and IoT in a local and privacy-friendly manner. For that reason I have deployed two distinct wireless standards: The main one is ZigBee and the second one is ISM via a software defined radio. In order to connect sensors, actors and automation engines, I have deployed an MQTT event bus.

With Zigbee2MQTT I make all ZigBee devices available on that event bus and with rtl_433’s MQTT output I am bridging ISM based signals to the event bus. From the user’s perspective Home Assistant is the main component, which is also connected to MQTT. It is mainly a frontend to control, visualize and automate. The web interface in the browser works excellent and it comes with very good mobile apps. It even features a voice assistant, based on privacy friendly local open source neural-networks: OpenAI’s Whisper as a speech to text engine and Piper as a text to speech engine.

Home Automation and IoT (2/2)

Since there is a bit of external hardware involved, I want to showcase how this can be interconnected with a Kubernetes cluster. Home automation associated workloads will run mostly on the worker nodes like all the other workloads as well. Unfortunately, there is no high availability solution to share a USB device among different host systems. For that reason the devices will be plugged into specific worker nodes.

For example the webcam will be connected to worker-1, the ZigBee transceiver to woker-2 and the RTL-SDR, our software defined radio, to worker-3. Via specific nodeSelector labels I am scheduling the pods of a StatefulSet to the respective nodes. motion-hall-0 to worker-1, zigbee2mqtt-0 to worker-2 and sdr-0 to worker-3.

The main interconnection is represented by the MQTT event bus, which is a StatefulSet running somewhere in the cluster with a corresponding Service with a ClusterIP. The pod/motion-hall-0 performs motion detection based on the webcam data and sends an MQTT event via TCP to the corresponding Service. Same for the pod/zigbee2mqtt-0, which bridges ZigBee data via TCP to the event broker. Also the pod/sdr-0 sends received signals as MQTT events via TCP to MQTT. Home Assistant is also set up as a StatefulSet and runs somewhere in the cluster. It heavily relies on the MQTT event bus also via the service and receives and publishes MQTT events there. This is how one could set up a complete local home automation, with rich features and complete data sovereignty.

Secret Management

Although passwordless authentication with Webauthn is in sight, we won’t get rid of traditional passwords any time soon.

That’s why I am operating Vaultwarden, which is a Bitwarden API-compatible alternative, based on Rust. Conveniently, it can make use of the Bitwarden ecosystem and offers beside the browser interface and the extension also mobile apps and a cli.

Future Work

Although the setup is quite remarkable, I have a few items of my to-do list and am mainly blocked by time constraints as I am doing this in my free-time. One feature I would like to have is an elastic node setup, where I can scale the number of workers up and down, based on needs. This would be an optimization of power consumption, however, some worker-local properties like the dedicated USB hardware must be then set up differently.

An evergreen on my TODO list is the modernization of the stack. As you may have noticed there is quite some rather traditional software around (I am looking at you email). I am striving for continously replacing this legacy services with modern, 12-factor apps. Strongly linked to this, is the migration of workloads which require a shared filesystem to a modern object store.

The last big aspect is automation, where I want to reduce manual work and intervention. I see still the most potential in the field of application lifecycle management.

Conclusion

This brings me to the end of my talk and it is time to wrap it up.

I am honest with you: It is quite some work but overall it is a lot of fun and most of the time worth the effort. Thanks to redundancy, I would consider my setup as extremely maintenance-friendly and I am able to replace nodes at any time. My cluster represents a great environment to learn and experiment and can even be the basis for educating others. I have learned that energy-efficient operation is possible, mostly because of energy-efficient system on chips. Furthermore, I have shown, that even in home setups, there is no need to sacrifice storage provision, “HA” loadbalancing and ingress as well as rich services in general.

I hope you found the talk inspiring and I am curious to hear about your cluster or answer some questions.