More devops than I bargained for
Thanks to my sponsors: Chris Walker, Chirag Jain, (18D)eezNuts, L0r3m1p5um, Matt Jackson, Alan O'Donnell, Gorazd Brumen, Berkus Decker, callym, Cole Kurkowski, Dennis Henderson, Pete Bevin, Chris Thackrey, Jack Duvall, Marcus Brito, Daniel Papp, Paige Ruten, Bob Ippolito, Johnathan Pagnutti, Mikkel Rasmussen and 262 more
Background
I recently had a bit of impromptu disaster recovery, and it gave me a hunger for more! More downtime! More kubernetes manifest! More DNS! Ahhhh!
The plan was really simple. I love dedicated Hetzner servers with all my heart but they are not very fungible.
You have to wait entire minutes for a new dedicated server to be provisioned. Sometimes you pay a setup fee, et cetera. And at some point to server static websites and serve as a K3S server, it’s simply just too big, and approximately twice the price that I should pay.
I have gotten nervous about the world economy — Amos wrote on April 7th, as the American and Japanese stock markets just crashed — but it’s also a fun optimization problem. How much money do I actually need to spend on my infrastructure to get it to perform the way I want it to?
So I decided to move from an x86_64 dedicated server with 32 gigs of RAM and 16 cores, which cost me about 41 euros per month, to an aarch64 instance with 8 Ampere cores, 16 gigs of RAM, which costs 12 euros a month!
See, it’s not a significant saving, but it’s the first in my fleet of servers that is arm64 — And I figured, well, I recently set up continuous integration and continuous delivery for my CMS software so that it will build and ship x86_64-unknown-linux-gnu and aarch64-apple-darwin binaries to as Forgejo generic packages and a private Homebrew tap so… what’s one more target?
Right?
Ha ha ha.
Has anyone ever built it for arm64 linux before?
For most things, the answer is yes.
On the “main” / “control” / “k3s server” node, I run services like:
- well, k3s itself, obvs
- cert-manager
- traefik v3 (and I get HTTP/3)
- a full prometheus stack, including grafana
- a couple postgres clusters
- umami for analytics
All of those are either ubiquitous or written in Go, which has excellent tooling for cross compilation, which means they’ve had ARM64 images forever.
A few of my Dockerfile(s) downloaded binaries for stuff like regclient
, an
ffmpeg
static build, etc. — a simple “make this work for arm64 too” prompt
to Claude 3.5 Sonnet was enough to add the requisite bashisms:
# Download the archive
echo -e "\033[1;34m📥 Downloading home-drawio \033[1;33m${HOME_DRAWIO_VERSION}\033[0m for \033[1;36m${ARCH_NAME}\033[0m..."
# Map platform architecture to package architecture string
if [ "${ARCH_NAME}" == "amd64" ]; then
PKG_ARCH="x86_64-unknown-linux-gnu"
elif [ "${ARCH_NAME}" == "arm64" ]; then
PKG_ARCH="aarch64-unknown-linux-gnu"
else
echo -e "\033[1;31m❌ Error: Unsupported architecture: ${ARCH_NAME}\033[0m" >&2
exit 1
fi
curl --fail --location --retry 3 --retry-delay 5 -H "Authorization: token ${FORGEJO_READWRITE_TOKEN}" \
"https://code.bearcove.cloud/api/packages/bearcove/generic/home-drawio/${HOME_DRAWIO_VERSION}/${PKG_ARCH}.tar.xz" \
-o "${TEMP_DIR}/home-drawio.tar.xz"
I like to request colors to make the log output more readable to me and emojis which also help with readability. I ask LLMs to generate tools that always show a plan for what they’re going to do first, ask the user for consent, report progress while doing it, and print a summary of actions takens and errors encountered at the end.
I have used them with great success with “devops”: there are a few pieces you need to be really solid, but the rest is all glue. I typically prototype in bash or TypeScript and then port it to rust if I need it to run fast or be more correct.
Didn’t LLMs lead you astray last time?
Babe, I mean bear, they lead me astray _every time. But I’m the one driving.
Fair enough — cool bear said, uttering words Amos had written.
I had forgotten how many moving parts were involved in my own software?
Most native dependencies are just an APT install away, since I use Debian 12 as a base image, and the Debian project has done the hard work of packaging just about everything.
I think the only thing I built from source is libdav7d, so that it’s recent enough?
home-drawio
is one of my custom components: it’s a binary that’s able to
convert draw.io diagrams to SVG. I used to shell out to node.js instead, but decided
I didn’t like it, so now it’s bundled with bun as bytecode:
This is a Justfile, for the just task runner, which replaces make for me, since I already have one (or three) build systems.
Its output is pleasing:
Multi-arch container images
One problem I ran into pretty early is that I had no idea how to make and push a container image that works for multiple architectures.
Up until now, I’d always been building images, and pushing them immediately with tags like:
code.bearcove.cloud/bearcove/beardist:latest
code.bearcove.cloud/bearcove/home:33.0.0
As far as I can tell, the way to go is to pick a convention for arch-specific tags, like:
code.bearcove.cloud/bearcove/beardist:latest-arm64
code.bearcove.cloud/bearcove/beardist:latest-amd64
The fact arm64
and amd64
look so close from afar is a disgrace, btw.
And then create a multi-arch manifest that is the thing pushed under :latest
.
If you’re using docker to build images, then you can do something like this:
echo -e "\033[1;31m🗑️ Removing existing manifest: \033[0;32m{{BASE}}/$target:latest\033[0m" && \
docker manifest rm "{{BASE}}/$target:latest" || true && \
echo -e "\033[1;36m📝 Creating manifest: \033[0;32m{{BASE}}/$target:latest\033[0m" && \
docker manifest create "{{BASE}}/$target:latest" \
$(for platform in $PLATFORMS; do \
arch=$(echo $platform | cut -d/ -f2); \
echo "{{BASE}}/$target:latest-$arch"; \
done) && \
echo -e "\033[1;32m📤 Pushing manifest: \033[0;32m{{BASE}}/$target:latest\033[0m" && \
docker manifest push "{{BASE}}/$target:latest"; \
echo -e "\033[1;32m✅ Completed \033[1;33m$target\033[1;32m successfully!\033[0m"; \
If you’re using docker buildx
, then it can do multi-arch builds for you! But that is
not supported by OrbStack, or at least, I couldn’t get it working.
However, that’s irrelevant to me, because, most of my Dockerfiles are just there to declare dependencies — I don’t actually build inside of them.
Base images + regctl
See, it’s annoying to have access to a docker daemon in CI. Really, I’m a grown up: I can take on the risk of making the build environment and runtime environment match — I just want to copy my binary into a base image I know and control.
So… I have this repack.sh
script:
The timestamp stuff is particularly load-bearing.
This is, like… not nix, but it provides a lot of the value that nix gave me — assembling docker images without docker, allowing us the nice property “if a layer didn’t change, then it can just be reused”.
The rest of the value I got from nix, and from earthly after that (Cthulhu rest its eternal soul), is “don’t rebuild if you don’t need to rebuild”, which I achieved through timelord, a simple utility that saves and restores file timestamps, unless their contents has changed.
I look forward to timelord being completely deprecated by cargo’s checksum-freshness feature, just like I look forward to replacing cargo-sweep with gc, and cargo-hakari with feature unification.
So anyway, this is the important part of the script:
regctl image mod "$BASE_IMAGE" --create "$IMAGE_NAME" \
--layer-add "dir=$OCI_LAYOUT_DIR"
regctl comes from regclient and does not need a docker daemon present.
This adds a layer from a directory (which means it has to tar it and sha256 it — that’s basically all an OCI layer is).
Then we push it to the registry:
regctl image copy "$IMAGE_NAME"{,}
And then… then what? Then we can’t actually create a manifest because,
contrary to base images, we need to build images like home
(the name my CMS
has this week, for those who follow along at… well, at home) from a machine
with a matching infrastructure because the process is:
- In a Debian 12 arm64/amd64 container
- Build with
beardist
(which invokescargo build
, copies around dynamic libraries, does verifications, compression, uploads) - Add built binary on top of base layer of the correct arch, and push OCI image with
regctl
So the architecture outside the image and inside the image must match.
beardist
itself is distributed as a (multi-arch) image, and in fact, is built
using itself, which means it has to be bootstrapped somehow.
And the way it’s bootstrapped is:
- From a build environment matching the target env…
- Run
cargo install --path .
inbeardist/
- Run
BEARDIST_CACHE=/tmp/beardist beardist build
- Run
./repack.sh
And voila! Now, beardist can build itself in CI, using its own docker image, which will be overwritten on every tag release.
If needed, the bootstrap can be redone, or an earlier “working” tag can simply be used. The chain hasn’t broken yet, a couple weeks in.
Once both architectures are built in CI, as two different Forgejo Actions jobs, a third job is triggered:
That multify.sh
script is shown here, and… is a bit more manual than the
previous strategy since, we don’t have docker! Only regctl.
Which doesn’t come with “manifest-building” utilities.
Luckily, it’s “just JSON”, right?
This script, too, is pleasing:
amos in 🌐 souffle in beardist on main [!] via 🦀 v1.86.0
❯ time GITHUB_REF=refs/tags/v3.8.9 ./multify.sh
🔍 Starting multi-architecture container manifest creation...
📦 Detected tag: 3.8.9
📦 Getting digests and sizes...
⬇️ Fetching AMD64 digest...
⬇️ Fetching ARM64 digest...
✅ ARM64 digest retrieved successfully!
✅ AMD64 digest retrieved successfully!
📏 Calculating AMD64 manifest size...
✅ AMD64 size: 2243 bytes
📏 Calculating ARM64 manifest size...
✅ ARM64 size: 2242 bytes
📝 Creating manifest.json...
✅ manifest.json created successfully!
🚀 Pushing manifest.json to registry...
📤 Pushing manifest for tag: 3.8.9
✅ Successfully pushed manifest for tag: 3.8.9
📤 Pushing manifest for tag: latest
✅ Successfully pushed manifest for tag: latest
🎉 Multi-architecture manifest(s) successfully pushed to registry!
________________________________________________________
Executed in 1.45 secs fish external
usr time 110.06 millis 0.29 millis 109.77 millis
sys time 110.70 millis 2.23 millis 108.47 millis
…and will probably be rewritten in Rust eventually, or collapsed into beardist, which, isn’t linked because it’s not open-source. It’s custom-made for my needs — make your own!
Here’s the generated manifest:
{
"schemaVersion": 2,
"mediaType": "application/vnd.docker.distribution.manifest.list.v2+json",
"manifests": [
{
"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
"size": 2243,
"digest": "sha256:b2dc52ed0fc06d10b4681405289004da8dab86776223466beb4a84a86fbc8ade",
"platform": {
"architecture": "amd64",
"os": "linux"
}
},
{
"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
"size": 2242,
"digest": "sha256:f79fbe3ae00713394d69970bb5f74af0d043dbf703d8b9ccb2a3f3c110cbd88d",
"platform": {
"architecture": "arm64",
"os": "linux"
}
}
]
}
And uhh yeah, it works!
Well. It worked for beardist
— and then I could have other builds operate from
the beardist:latest
image, no matter whether they were running on arm64
workers or amd64 workers…
…but by this point, I didn’t really have any good amd64 workers left.
I had:
- Some VM I run with UTM on macOS
- That 8-core arm64 machine (good enough for Rust CI builds)
- 5 2-core amd64 machines
And uhhhh… I tried. But after 30 minutes, the forgejo actions job timeout kicked in and.. yeah. It couldn’t build my entire website software.
Which, to be fair:
home on main via 🦀 v1.85.1
❯ cat Cargo.lock | grep -F '[[package' | wc -l
838
…is not surprising.
Because there is a persistent build storage, I could’ve just retried until it finally built, but… my site was still down at this point! I had preemptively migrated everything else, including postgres clusters, forgejo volumes etc. — but had left my CMS for last because, well, it’s my CMS! I know this!
At this point I realized my Mac Studio has to be on all the time, since it’s running a VM which does the macOS builds. And it has 32GB RAM… it can probably fit another x86_64 Linux VM, right?
Well, it can, but:
- -6GB RAM is kinda brutal when I’m editing 4K videos
- x86_64 emulation via qemu is slow, multicore emulation even moreso
- USB SATA SSDs are slow (I don’t have enough internal storage for all my VMs)
I only realized that after hours of fiddling around to get IPv6 to work inside a container inside the VM inside my Mac Studio, becauuseeeee….
More like IPv5
I don’t know, okay? At some point I’ll do a deep dive, but… it was past 1AM, I don’t know, I just needed things to work.
Here’s what I think I understood. Maybe.
In Kubernetes, workloads are performed in “containers”, which are run in “pods”, which are scheduled on “nodes”.
In my setup, “nodes” are just, the Hetzner Cloud VMs:
I like the nice little map visualization. I think more cloud providers should do that.
Their API is also very fast.
And my x86_64 VM that I ran on my MacBookPro.
To k3s, they’re the same, they’re all just… nodes:
~
❯ k get nodes
NAME STATUS ROLES AGE VERSION
domino Ready <none> 15h v1.31.6+k3s1
flam Ready <none> 18h v1.31.6+k3s1
hawk Ready <none> 18h v1.31.6+k3s1
heim Ready <none> 18h v1.31.6+k3s1
kaya Ready <none> 18h v1.31.6+k3s1
marl Ready <none> 18h v1.31.6+k3s1
styx Ready control-plane,etcd,master 18h v1.31.6+k3s1
Those nodes need not have a publicly routable IPv4 or IPv6 address: they can be behind NAT (Network Address Translation), and they’ll still be able to:
- reach out to the k3s server
- register themselves as nodes (given the proper auth token)
- and join the overlay network
Why an overlay network? Because pods have their own IP address.
And in a simple setup like this, the pod IP addresses are not publicly routable either.
In my current setup…
infra on main [$] via 🦀 v1.85.0
❯ rg 'cidr' roles/
roles/k3s/leader/templates/config.yaml.j2
1:cluster-cidr: 10.42.0.0/16,fd00:42::/48
2:service-cidr: 10.43.0.0/16,fd00:43::/112
…neither the IPv4 or IPv6 addresses are “publicly routable” — if you send a packet to any of these set as destination to an internet router, it will chuckle and drop the packet.
The IPv4 address block is called “private address space” and the IPv6 address block is called “Unique Local Address” or ULA.
However, these are perfectly fine to use for a private overlay network like the one set up by k3s so that pods can talk to each other.
What’s the level of granularity of a pod? Like.. how many pods to an app?
To give you an example: traefik
is the “ingress”, aka the HTTP reverse proxy,
so it needs one pod per edge node:
fasterthanli.me on main [$!]
❯ k get pods -n 'traefik' -o json | jq -c '.items[] | {nodeName: .spec.nodeName, podIP: .status.podIP}'
{"nodeName":"domino","podIP":"192.168.210.3"}
{"nodeName":"styx","podIP":"49.13.119.8"}
{"nodeName":"domino","podIP":"192.168.1.100"}
{"nodeName":"heim","podIP":"157.180.27.172"}
{"nodeName":"hawk","podIP":"116.202.24.111"}
{"nodeName":"kaya","podIP":"5.223.56.87"}
{"nodeName":"marl","podIP":"5.78.90.129"}
{"nodeName":"flam","podIP":"5.161.220.244"}
But those pods are a little special — they’re using host networking.
When I point a DNS record for fasterthanli.me
at one of my nodes, I need it to
listen on port 80 and 443, and I need those connections to go straight to
traefik — hence, the pod IP is actually the publicly routable IP of that node.
fasterthanli.me on main [$!]
❯ ssh root@49.13.119.8 -- "ip addr show eth0 | grep --color=always -E '(inet|inet6) ([0-9a-f:.]+)'"
inet 49.13.119.8/32 brd 49.13.119.8 scope global dynamic eth0
inet6 2a01:4f8:c17:34b1::1/64 scope global
inet6 fe80::9400:4ff:fe32:8ea/64 scope link
Redundant, I know!
But most pods are not special. They have an IP address that comes from the CIDR we defined earlier: that’s the
case of pods in the home
namespace:
fasterthanli.me on main [$!]
❯ k get pods -n 'home' -o json | jq -c '.items[] | {nodeName: .spec.nodeName, podIP: .status.podIP}'
{"nodeName":"heim","podIP":"10.42.40.130"}
{"nodeName":"hawk","podIP":"10.42.123.3"}
{"nodeName":"marl","podIP":"10.42.71.66"}
{"nodeName":"kaya","podIP":"10.42.29.194"}
{"nodeName":"hawk","podIP":"10.42.123.2"}
{"nodeName":"heim","podIP":"10.42.40.131"}
{"nodeName":"styx","podIP":"10.42.29.2"}
These are all in 10.42.0.0/16
!
From one pod, we can reach another:
fasterthanli.me on main [$!]
❯ k exec -n home cub-dc9f5b494-bhnjr -it -- curl -H 'Host: fasterthanli.me' -I http://10.42.40.130:1111
HTTP/1.1 200 OK
content-type: text/html; charset=utf-8
cache-control: no-cache
x-source: eu-north-1.heim.cub-dc9f5b494-bhnjr
content-length: 105153
date: Mon, 07 Apr 2025 17:04:57 GMT
And that is what the overlay network is about.
But it’s not the same thing as having actual connectivity to the internet, or “egress”.
I’ll save you all the different troubleshooting steps I went through, but basically, here’s how things ended up working out: I ended up installing Calico to replace Flannel.
The first big difference is: instead of sending overlay packets as VXLAN over UDP, it actually establishes a wireguard network — traffic between nodes is now encrypted properly.
Apparently Flannel supports that too, it’s just not enabled by default.
And the second big difference is that it’s actually able to do something called NAT66.
Wait wait wait. What?
The NAT king calls
Okay, so let’s look at the simple case, right? We have a pod on a Hetzner cloud VM.
It makes an outbound request to a public IPv4 address — how is it routed?
I don’t know, let’s check traceroute?
Good instinct! Let’s do that. So we’ll create a pod with net-shooter
---
apiVersion: v1
kind: Pod
metadata:
name: net-shooter
labels:
app: net-shooter
spec:
containers:
- name: net-shooter
image: nicolaka/netshoot
command:
- sleep
- infinity
nodeSelector:
provider: hcloud
Oh yeah, by the way, I changed my deploy script to not use yq
or rsync
and.. just use kubectl
with a bunch of flags:
infra on main [$?] via 🦀 v1.85.0
❯ ./deploy manifests/tests/
🔍 Performing dry run of kubectl apply...
pod/net-shooter created (server dry run)
❓ Do you want to apply these changes? (y/n)
y
✅ Applying changes...
pod/net-shooter created
📤 Preparing to commit and push changes...
❓ Enter a commit message:
create test pod
[main 1ae5b54] create test pod
1 file changed, 14 insertions(+)
create mode 100644 manifests/tests/000-ip-routing-test.yaml
Enumerating objects: 7, done.
Counting objects: 100% (7/7), done.
Delta compression using up to 12 threads
Compressing objects: 100% (5/5), done.
Writing objects: 100% (5/5), 531 bytes | 531.00 KiB/s, done.
Total 5 (delta 2), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.
To https://github.com/bearcove/infra.git
f60c806..1ae5b54 main -> main
✅ Changes have been committed and pushed.
Here’s the complete deploy
script if you’re interested:
Is our pod running?
infra on main [$] via 🦀 v1.85.0
❯ k get pods -l app=net-shooter
NAME READY STATUS RESTARTS AGE
net-shooter 1/1 Running 0 10s
Yes, good! Let’s run some tests shall we?
infra on main [$] via 🦀 v1.85.0
❯ k exec net-shooter -- ip addr show dev eth0 | grep -E 'inet |inet6 '
inet 10.42.29.15/32 scope global eth0
inet6 fd00:42:0:1d1b:89d4:e2d6:158f:6f0f/128 scope global
inet6 fe80::ccf8:a1ff:fe55:ac8d/64 scope link proto kernel_ll
Okay, it definitely has an IPv4 address and an IPv6 address taken from our
respective CIDR ranges and also a link local address starting with fe80
.
Let’s try to do a traceroute to one of the other pods:
infra on main [$] via 🦀 v1.85.0
❯ k exec net-shooter -- traceroute 10.42.123.3
traceroute to 10.42.123.3 (10.42.123.3), 30 hops max, 46 byte packets
1 styx (49.13.119.8) 0.007 ms 0.015 ms 0.007 ms
2 10.42.123.1 (10.42.123.1) 2.354 ms 1.786 ms 0.844 ms
3 10.42.123.3 (10.42.123.3) 0.688 ms 1.061 ms 0.592 ms
Pretty straightforward.
Pods also have IPv6 addresses, since we’re dual-stack!
fasterthanli.me on main [$]
❯ k get pods -n 'home' -o json | jq -c '.items[] | {nodeName: .spec.nodeName, name: .metadata.name, podIPs: .status.podIPs}'
{"nodeName":"flam","name":"cub-695b7f6fdd-42z5m","podIPs":[{"ip":"10.42.52.80"},{"ip":"fd00:42:0:42b4:6c65:9873:2890:3453"}]}
{"nodeName":"flam","name":"cub-695b7f6fdd-65dj5","podIPs":[{"ip":"10.42.52.78"},{"ip":"fd00:42:0:42b4:6c65:9873:2890:3451"}]}
{"nodeName":"marl","name":"cub-695b7f6fdd-89ztk","podIPs":[{"ip":"10.42.71.75"},{"ip":"fd00:42:0:4746:36a6:b9d9:c23:ef8e"}]}
{"nodeName":"hawk","name":"cub-695b7f6fdd-c5s5s","podIPs":[{"ip":"10.42.123.13"},{"ip":"fd00:42:0:f6da:458d:a644:59c2:d4e"}]}
{"nodeName":"marl","name":"cub-695b7f6fdd-knh47","podIPs":[{"ip":"10.42.71.77"},{"ip":"fd00:42:0:4746:36a6:b9d9:c23:ef90"}]}
{"nodeName":"heim","name":"cub-695b7f6fdd-rhll9","podIPs":[{"ip":"10.42.40.136"},{"ip":"fd00:42:0:28ae:bd57:7c2d:4a15:a89"}]}
{"nodeName":"styx","name":"mom-85d6745745-vkjw2","podIPs":[{"ip":"10.42.29.20"},{"ip":"fd00:42:0:1d1b:89d4:e2d6:158f:6f16"}]}
And similarly we can trace that route:
infra on main [$] via 🦀 v1.85.0
❯ k exec net-shooter -- traceroute fd00:42:0:4746:36a6:b9d9:c23:ef8e
traceroute to fd00:42:0:4746:36a6:b9d9:c23:ef8e (fd00:42:0:4746:36a6:b9d9:c23:ef8e), 30 hops max, 72 byte packets
1 fd00:42:0:1d1b:89d4:e2d6:158f:6f15 (fd00:42:0:1d1b:89d4:e2d6:158f:6f15) 0.013 ms 0.010 ms 0.011 ms
2 fd00:42:0:4746:36a6:b9d9:c23:ef89 (fd00:42:0:4746:36a6:b9d9:c23:ef89) 170.372 ms 169.522 ms 169.491 ms
3 fd00:42:0:4746:36a6:b9d9:c23:ef8e (fd00:42:0:4746:36a6:b9d9:c23:ef8e) 169.560 ms 169.474 ms 169.501 ms
But traceroute is… bordeline useless.
What we want to know here is answered better by the ip route
command:
infra on main [$] via 🦀 v1.85.0
❯ k exec net-shooter -it -- ip -4 route show
default via 169.254.1.1 dev eth0
169.254.1.1 dev eth0 scope link
infra on main [$] via 🦀 v1.85.0
❯ k exec net-shooter -it -- ip -6 route show
fd00:42:0:1d1b:89d4:e2d6:158f:6f0f dev eth0 proto kernel metric 256 pref medium
fe80::/64 dev eth0 proto kernel metric 256 pref medium
default via fe80::ecee:eeff:feee:eeee dev eth0 metric 1024 pref medium
To me, this is interesting, because, well… both 169.254.1.1
and fe80::/64
are “link-local” addresses: the only other place I’ve seen them is when DHCP
fails and your computer decides to pick an address that, I guess, would allow
you to communicate with something at the other end even without DHCP?
So, actually, the trick is coming from outside the pod… because if we ask a random server from the internet what’s our IP address, it will have a radically different answer than what we’ve seen so far:
infra on main [$] via 🦀 v1.85.0
❯ k exec net-shooter -it -- curl -4 https://icanhazip.com
49.13.119.8
infra on main [$] via 🦀 v1.85.0
❯ k exec net-shooter -it -- curl -6 https://icanhazip.com
2a01:4f8:c17:34b1::1
This is the IP address of the node, not the pod — NAT is happening, both for IPv4 (NAT44):
root@styx ~# iptables -4 -t nat -L cali-POSTROUTING -v -n
Chain cali-POSTROUTING (1 references)
pkts bytes target prot opt in out source destination
12104 740K cali-fip-snat 0 -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:Z-c7XtVd2Bq7s_hA */
12104 740K cali-nat-outgoing 0 -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:nYKhEzDlr11Jccal */
0 0 MASQUERADE 0 -- * vxlan.calico 0.0.0.0/0 0.0.0.0/0 /* cali:e9dnSgSVNmIcpVhP */ ADDRTYPE match src-type !LOCAL limit-out ADDRTYPE match src-type LOCAL random-fully
0 0 MASQUERADE 0 -- * wireguard.cali 0.0.0.0/0 0.0.0.0/0 /* cali:kgfCOPW4UKtzMAmO */ ADDRTYPE match src-type !LOCAL limit-out ADDRTYPE match src-type LOCAL random-fully
And for IPv6 (NAT66):
root@styx ~# ip6tables -t nat -L cali-POSTROUTING -v -n
Chain cali-POSTROUTING (1 references)
pkts bytes target prot opt in out source destination
2015 173K cali-fip-snat 0 -- * * ::/0 ::/0 /* cali:Z-c7XtVd2Bq7s_hA */
2015 173K cali-nat-outgoing 0 -- * * ::/0 ::/0 /* cali:nYKhEzDlr11Jccal */
0 0 MASQUERADE 0 -- * vxlan-v6.calico ::/0 ::/0 /* cali:MtS-9OgAQy-fAM-w */ ADDRTYPE match src-type !LOCAL limit-out ADDRTYPE match src-type LOCAL random-fully
And this is fine and great for virtual machines hosted on Hetzner which have a public IPv4 address and have a public IPv6 prefix.
But what happens when, say, you tell your home computer to join your Kubernetes cluster? That’s exactly what I ended up doing: let’s see what happens with it!
infra on main [$] via 🦀 v1.85.0
❯ ssh root@domino ip addr show dev enp3s0 | grep -E 'inet6?'
inet 192.168.1.100/24 brd 192.168.1.255 scope global dynamic noprefixroute enp3s0
inet6 2a01:e0a:de8:a760:bf51:36ff:d905:1432/64 scope global temporary dynamic
inet6 2a01:e0a:de8:a760:7656:3cff:fe28:5746/64 scope global dynamic mngtmpaddr noprefixroute
inet6 fe80::7656:3cff:fe28:5746/64 scope link noprefixroute
It does have a publicly routable IPv6 because NAT is not required there. NAT is only being done for IPv4.
infra on main [$] via 🦀 v1.85.0
❯ ssh root@domino -- ip -6 route show default
default via fe80::3a07:16ff:fec2:bc19 dev enp3s0 proto ra metric 100 pref medium
infra on main [$] via 🦀 v1.85.0
❯ ssh root@domino -- ip -4 route show default
default via 192.168.1.254 dev enp3s0 proto dhcp src 192.168.1.100 metric 100
The default routes for IPv4 and IPv6 go directly to the router, and icanhazip reveal our actual public IPv4:
infra on main [$] via 🦀 v1.85.0
❯ ssh root@domino -- curl -s -6 https://icanhazip.com
2a01:e0a:de8:a760:bf51:36ff:d905:1432
infra on main [$] via 🦀 v1.85.0
❯ ssh root@domino -- curl -s -4 https://icanhazip.com
87.182.152.211
And you know what this means?
No, but you’ll tell us, right?
Why the fuck is everything broken
Again and again and again ever since I had this home node join my Kubernetes cluster I have had nothing but issues.
And every time it’s been the exact same issue and it’s taken me so long to realize what was happening.
This node, called Domino, has IPv4 egress, but doesn’t have IPv4 ingress!
domino is able to establish a connection with google.com over IPv4 and exchange packets no problem, but it cannot host an IPv4 service! If it does, then it’s going to be giving out an IP that is not routable!
Back to our node IPs, let’s take kaya
for example:
infra on main [$] via 🦀 v1.85.0
❯ k get nodes -o json | jq -c '.items[] | select(.metadata.name == "kaya") | .status.addresses'
[{"address":"5.223.56.87","type":"InternalIP"},{"address":"2a01:4ff:2f0:10be::1","type":"InternalIP"},{"address":"kaya","type":"Hostname"}]
It has an internal IP of 5.223.56.87
— very well. What’s the IP of the traefik
pod on that node?
infra on main [$] via 🦀 v1.85.0
❯ k get pods -n traefik -o json | jq -c -r '.items[] | select(.spec.nodeName == "kaya") | .status.podIPs'
[{"ip":"5.223.56.87"},{"ip":"2a01:4ff:2f0:10be::1"}]
The very same! It uses host networking.
But on domino?
infra on main [$] via 🦀 v1.85.0
❯ k get pods -n traefik -o json | jq -c -r '.items[] | select(.spec.nodeName == "domino") | .status.podIPs'
[{"ip":"192.168.1.100"},{"ip":"2a01:e0a:de8:a760:17c3:ece0:634:8ec7"}]
It’s that LAN IP, 192.168.1.100.
And that caused me some problems when, after restarting pods, the Kubernetes server decided to schedule cert manager challenge pods on Domino.
What’s cert-manager? It’s a neat thing that provisions TLS certificates automatically:
you create a Certificate
, like so:
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: fasterthanli-me-cert
namespace: home
spec:
secretName: fasterthanli-me-cert-secret
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames: [fasterthanli.me, cdn.fasterthanli.me]
And then internally it makes certificate request objects, it makes orders, it talks with the Let’s Encrypt system, and it’s able to do changes of different kind. The one that I was using was HTTP-01.
Which works by making a request on some well-known path. Literally a path that
starts with /.well-known/acme-challenge/
- the cert-manager creates a temporary
HTTP endpoint that the Let’s Encrypt servers can access to verify that you
control the domain you’re requesting a certificate for.
This works by creating an ingress resource which in turn is handled by traefik
,
to serve just that path, over HTTP (and the usual site over HTTPS). As soon as the
TLS certificate is created, it’s swapped in and traefik starts using it.
And that’s all well and good. But this is where we find out that there is actually at least two types of services that can be used in a Kubernetes setup like mine: ClusterIP and NodePort.
And really, in my situation, there’s no good reason for any service to be NodePort
except for
traefik
, which must use host networking since there’s no load balancing in front of it, I’m not
doing managed kubernetes, and I also can’t roll my own load balancer at Level 3.
And yet, the cert manager challenge services defaulted to NodePort
for some
reason — which used to always work on the nodes that were actually hosted on
Hed’s Node Cloud VMs, but didn’t work on domino!
Because domino is doing double NAT for IPv4, and, for IPv6, is still doing
single NAT, because even though the node would be able to route a whole /64
’s
worth of IPv6 addresses, Calico is picking pod IP addresses from the pools we
gave it:
---
apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
name: ipv4-pool
spec:
cidr: 10.42.0.0/16
ipipMode: Never
vxlanMode: Always
natOutgoing: true
disabled: false
nodeSelector: all()
---
apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
name: ipv6-pool
spec:
cidr: fd00:42::/48
ipipMode: Never
vxlanMode: Always
natOutgoing: true
disabled: false
nodeSelector: all()
And that’s… just not great… to learn about, at 4 in the morning, when everything’s been down for hours.
Like… I’ve never asked to learn all this, man. I was just trying to throw a little arm64 in the mix. I miss solving problems with Dockerfile. Let me out. LET ME OUT.
Wait, wait, wait, so is there a way to disable NAT66 just for that node?
I think there is, but I’ve just been too scared to touch it so far.
But… where’s the fun in that?
Ah, damn it, you’re right.
Bye NATalie
If I understood everything correctly, we need to create another IP pool, just for our node:
# ✂️: ipv4 pool
---
apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
name: ipv6-pool
spec:
cidr: fd00:42::/48
ipipMode: Never
vxlanMode: Always
natOutgoing: true
disabled: false
nodeSelector: "kubernetes.io/hostname != 'domino'"
---
apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
name: public-ipv6-pool
spec:
cidr: 2a01:e0a:de8:a760::/64
ipipMode: Never
vxlanMode: Always
natOutgoing: false
disabled: false
nodeSelector: "kubernetes.io/hostname == 'domino'"
And apply it… and restart the pods, and… I don’t know, let’s test it:
---
apiVersion: v1
kind: Pod
metadata:
name: ipv6-server
spec:
nodeSelector:
kubernetes.io/hostname: domino
containers:
- name: server
image: python:3
command:
- python3
- -m
- http.server
- "8080"
- "--bind"
- "::" # <-- Listen on all IPv6 interfaces
ports:
- containerPort: 8080
protocol: TCP
infra on main [$?] via 🦀 v1.85.0
❯ kubectl apply -f manifests/tests/100-python.yaml
pod/ipv6-server created
Let’s see what IPs were assigned…
infra on main [$?] via 🦀 v1.85.0
❯ kubectl get pod ipv6-server -o jsonpath='{.status.podIPs}' | jq -c .
[{"ip":"10.42.210.16"},{"ip":"2a01:e0a:de8:a760:8ccd:f32f:73e5:da03"}]
…oooh, promising!
Now let’s see if we can access that pod?
infra on main [$] via 🦀 v1.85.0
❯ curl --connect-timeout 2 -I 'http://[2a01:e0a:de8:a760:8ccd:f32f:73e5:da03]:8080'
curl: (28) Failed to connect to 2a01:e0a:de8:a760:8ccd:f32f:73e5:da03 port 8080 after 2006 ms: Timeout was reached
Oh. We can’t.
Wait, but that’s from inside the LAN.
And? You think it’s going to work better outside the LAN?
Try it
Fine, fine if y-
amos in 🌐 styx in ~
❯ curl --connect-timeout 2 -I 'http://[2a01:e0a:de8:a760:8ccd:f32f:73e5:da03]:8080'
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/3.13.2
Date: Mon, 07 Apr 2025 19:50:04 GMT
Content-type: text/html; charset=utf-8
Content-Length: 832
You… what? The fuck?
Claude tells me this can be caused by “LAN Hairpinning” or “NDP Scope Problems”.
Well, I guess that’s why we have happy eyeballs, so that the IPv4 path will work on LAN and the IPv6 path will work on the public internet.
Good night, everyone — and thanks for following along!
Thanks to my sponsors: Walther, Adrián Garnier Artiñano, Yufan Lou, Michael Alyn Miller, old.woman.josiah, Dom, David Cornu, Beat Scherrer, Olly Swanson, qrpth, James Rhodes, Elendol, Julius Riegel, Romet Tagobert, jer, Ben Mitchell, Tobias Bahls, Alex Rudy, Hamilton Chapman, Justin Ossevoort and 262 more
My work is sponsored by people like you. Donate now so it can keep going:
Here's another article just for you:
Understanding Rust futures by going way too deep
So! Rust futures! Easy peasy lemon squeezy. Until it’s not. So let’s do the easy thing, and then instead of waiting for the hard thing to sneak up on us, we’ll go for it intentionally.
That’s all-around solid life advice.
Choo choo here comes the easy part 🚂💨
We make a new project:
$ cargo new waytoodeep
Created binary (application) `waytoodeep` package