We left off on a high note. The Kubernetes cluster was alive. But as any engineer knows, a working system is often just the prelude to the next, more interesting problem. While the cluster was technically functional, its architecture had a hidden Achilles' heel: a single point of failure for all incoming traffic.
My mission was clear: eliminate it. The tool for the job was MetalLB, and the task seemed simple. I was wrong. What followed was a multi-day investigation down a deep and winding networking rabbit hole. This is the case file for that investigation—a detective story that starts with misleading TLS errors, leads to phantom network blocks, and ends with the unmasking of a culprit buried in the very foundation of my virtualization tools. Join me as we solve the case of the unreachable cluster.
192.168.1.x
) and removed the nodeSelector
to allow any node to become the leader for the VIP.LoadBalancer
to ClusterIP
and back again. This successfully assigned a new, correct VIP to Traefik.SSL_ERROR_SYSCALL
, while my Traefik logs showed a cryptic local error: tls: bad record MAC
.bad record MAC
error strongly suggested a "man-in-the-middle" device corrupting TLS packets. We suspected an overly aggressive firewall on my router or VMs host.record MAC error
still occurred when connecting from another computer on my local network, proving the traffic corruption was happening closer to the cluster. Then, ping
ing the VIP revealed an even deeper issue: Destination Host Unreachable
. We couldn't even reach the IP at a basic network level.CiliumNetworkPolicy
to allow the traffic.arp -a
output showing a successful ARP entry for the VIP and logs showing MetalLB was announcing the service. ARP wasn't the problem after all. The node was reachable at Layer 2.tcpdump
on the leader node. The packet capture revealed the definitive truth:
eth1
(my main LAN).eth0
.ip route
table of my VM, which confirmed the diagnosis. My node's default route was incorrectly pointing to the eth0
interface, a private management network created by Vagrant.
default via 192.168.121.1 dev eth0
sudo ip route del default && sudo ip route add default via 192.168.1.254 dev eth1
Vagrantfile
, using a shell provisioner set to run: "always"
to enforce the correct default route on every single boot.ClusterIssuer
.cert-manager
was now able to contact Let's Encrypt, issue a valid certificate, and Traefik began serving it correctly.Address Single Points of Failure (SPOF) Systematically: Our journey began with eliminating an SPOF. The lesson is to always pursue robust, architecturally sound solutions (like Kubernetes-native MetalLB) rather than simply moving the SPOF to another component.
Networking Troubleshooting: Follow an L7-to-L2 Approach (Iteratively):
cert-manager
). Confirm what the application thinks is happening. Verify Kubernetes service-specific config (like externalTrafficPolicy
or whether a LoadBalancerIP
is sticky
).ping
for reachability and traceroute
for path. If ping
fails with "Destination Host Unreachable," always check Layer 2.arp -a
is Critical for L2 Diagnostics: When basic ping
fails, arp -a
confirms if the IP is even resolving to a MAC address. A successful ARP reply shifts the focus immediately to routing or higher-layer issues.
ip route
Maps Your World: If ARP works but traffic still fails, inspect the host's routing table. Misconfigured default gateways, competing routes, or incorrect interface assignments (eth0
vs. eth1
) are common culprits for asymmetric routing.
tcpdump
is the Ultimate Truth Serum (L2-L7): When logs mislead and pings don't tell the whole story, tcpdump
reveals precisely what packets are arriving, leaving, and being dropped. It was the definitive tool that exposed the asymmetric routing.
Know Your Tools' Hidden Quirk: Understand the default behaviors of your infrastructure tools. Vagrant's automatic management network (eth0
) creating a competing default route was the hidden antagonist in our debugging journey.
Trust But Verify, Always: Don't assume. Don't assume a LoadBalancer
IP will change automatically. Don't assume your router isn't interfering. Don't assume a ping
failure means ARP is broken. Test and get concrete data for every assumption.