Kubernetes Infrastructure Overhaul Q1 2026 | Boottify Engineering

Nginx Is Gone

As of April 2026, Nginx is completely removed from the Boottify stack. The package is purged (apt purge nginx*), /etc/nginx/ does not exist, and no nginx process runs anywhere on the host. Traefik is the sole ingress controller.

Traefik runs as a Kubernetes pod with hostPort 80/443 on the pod spec and also exposes NodePort 30080/30443. Platform domains (boottify.com, control.boottify.com) are routed via an IngressRoute that forwards to the nginx-platform service (which maps to 10.42.0.1:3000, the Next.js process running on the host). Customer app subdomains get their own IngressRoute resources generated at deploy time.

This removes an entire layer of reverse-proxy indirection that was causing header forwarding issues and complicating the TLS chain.

CoreDNS: Scheduled on the Wrong Node

The K3s cluster has two nodes: the Hetzner host (primary, where all platform workloads run) and an AWS worker node joined for extra capacity. In late March, a pod reschedule event landed CoreDNS on the AWS worker node.

The consequence was that all app pods — which run on the Hetzner host — could no longer resolve DNS. The root cause: cross-node pod networking is broken in this cluster because Flannel VXLAN uses the AWS worker's private IP (172.31.39.213) as its VTEP endpoint, which is unreachable from the Hetzner network. Pods on Hetzner cannot talk to pods on AWS.

The fix was to pin CoreDNS to the Hetzner host permanently by adding a nodeSelector to the CoreDNS manifest at /var/lib/rancher/k3s/server/manifests/coredns.yaml:

nodeSelector:
  kubernetes.io/hostname: ubuntu-2404-noble-amd64-base

This is now the policy for all infrastructure-critical pods: pin them to the Hetzner host. The AWS worker exists as overflow capacity for workloads that don't require cross-pod communication.

Hetzner DNS Forwarder

The CoreDNS fix exposed a second issue: Hetzner's network blocks outbound UDP and TCP port 53 to external resolvers. CoreDNS was configured to forward to 8.8.8.8 and 1.1.1.1, which failed silently once it ran on the Hetzner host. The fix was to update the Corefile forwarder to use Hetzner's own nameservers:

forward . 185.12.64.2 185.12.64.1

This is set in both the live ConfigMap (hot-reloaded by the CoreDNS reload plugin) and the persistent K3s manifest.

Dual iptables: The Ghost Traffic Problem

Ubuntu 24.04 ships with both iptables-legacy (x_tables kernel module) and iptables-nft active simultaneously. This caused a subtle but serious problem:

kube-proxy (managed by K3s) writes KUBE-SEP and KUBE-SVC chains to the nft table
The Flannel CNI bridge plugin writes hostPort DNAT rules to both tables on pod creation, but only updates the nft table on pod restarts
When pods restart, old KUBE-SEP chains in the legacy table still pointed at dead pod IPs
Some connections — depending on which rule evaluation path the kernel chose — were routed to the dead pod IP and dropped

The fix is a systemd timer (k3s-iptables-cleanup.timer) that runs every 5 minutes and flushes all legacy KUBE-SEP and CNI-DN chains, forcing all traffic to use the nft rules that kube-proxy keeps current. A companion service (k3s-forward-rules.service) flushes on startup and re-establishes Flannel FORWARD rules ahead of UFW's REJECT defaults.

This fix eliminated a class of intermittent 502 errors that had no obvious log trace.

Platform Fallback Pod

The platform now has a hot-standby fallback deployment (boottify-platform-fallback) that can activate within seconds if the primary service fails.

The fallback pod runs on port 3001 with hostNetwork: true, which means it can reach 127.0.0.1:3306 (MySQL) and the local Redis instance without any additional networking configuration — the same way the primary Next.js process does.

Activation and restoration are handled by /usr/local/bin/platform-failover:

# Activate fallback (routes traffic to port 3001)
platform-failover activate

# Restore primary (routes traffic back to port 3000)
platform-failover restore

# Check which mode is active
platform-failover status

A systemd drop-in on the primary service triggers platform-failover activate automatically on failure and calls platform-failover restore once the primary successfully starts again. The net result is that a primary service crash triggers automatic failover in under 10 seconds, with no manual intervention needed.

K3s Encryption Incident and Recovery

In early April an accidental overwrite of the K3s encryption config file replaced the AES-CBC key with an identity-only (no-op) config. All 20 Kubernetes secrets lost their at-rest encryption.

Recovery steps taken:

Deleted and recreated all 20 secrets from known values (the platform continued running throughout — secrets in etcd are always decrypted before use)
Restored TLS secrets for all active namespaces from cert files on disk
Cluster is currently running with identity-only encryption (functional but not encrypted at rest)

The wildcard TLS secret (*.boottify.com, expiring 2026-07-08) was restored to the default namespace and all active app namespaces. The proper AES-GCM re-encryption will be completed in a maintenance window: generate a new key, restart K3s, then force-replace all secrets with kubectl get secrets -A -o json | kubectl replace -f -.

ACME DNS-01 Wildcard Race Condition Fix

Wildcard certificate issuance (e.g. *.boottify.com) requires Let's Encrypt to verify two separate ACME authorizations, both of which create a _acme-challenge TXT record under the same DNS name. Our ACME client was processing these concurrently, which caused two interleaved sequences:

Authorization A creates the TXT record → gets the value from Hetzner API
Authorization B creates the TXT record (same name) → 409 conflict
Authorization A reads the record back → gets B's value, not its own

The fix in src/lib/ssl/acme-dns01.ts:

A challengeCreateLocks Map serialises TXT record creation per DNS name — the second authorisation waits for the first to finish before creating its record
A challengeStoredValues in-memory store avoids the read-after-write race by returning the locally known value rather than fetching it fresh from the API
skipChallengeVerification: true passed to client.auto() avoids a duplicate DNS check after our own propagation wait already confirmed the record is live

Wildcard cert issuance now completes reliably on the first attempt.

Session Cookie Name Fix

A subtle routing issue was discovered: proxy.ts was checking for a cookie named session to determine whether a request was authenticated. Lucia v3 actually names the cookie btfy_session (configurable via lucia.ts). Unauthenticated requests were incorrectly passing the proxy guard. Fixed — proxy now checks btfy_session.