James Routley

This is the third post in my four-part series tracking packets as they flow through a Kubernetes cluster. In Part 2, I went over pod-to-pod (east-west) traffic. Now let’s talk about traffic from an external user, through a LoadBalancer into the cluster, to a pod, and back again. Yep, all that.

Every packet destined for a Kubernetes Service has to pass through iptables rules that select a backend pod and modify packet headers. This is important for debugging connectivity problems, latency, and service configuration.

Let’s dig a bit into the Service resource type in Kubernetes.

Pods are “ephemeral,” meaning they are temporary. Every time a pod gets created or restarted, it gets a new IP address. Trying to connect to a pod’s IP address directly is brittle since the IP can change at a moment’s notice. Instead, use a Service to provide a more stable endpoint that will route your traffic to the pods it exposes.

ClusterIP (default): This will allocate a virtual IP from the Service CIDR for the service itself. You can only get to this IP from inside the cluster. It appears only in iptables or IPVS rules, not on any kind of network interface.

NodePort: This type of service opens a port (default range 30000-32767) directly on every node in the cluster. External traffic can reach the service via <node-ip>:<nodeport>.

LoadBalancer: This provisions a load balancer outside of the cluster within whatever platform you’re using (cloud provider or MetalLB for physical services). The load balancer obtains an external IP address and forwards traffic to the NodePort.

# View services and their types
kubectl get svc -o wide
# Output:
# NAME         TYPE           CLUSTER-IP    EXTERNAL-IP     PORT(S)        AGE
# kubernetes   ClusterIP      10.96.0.1     <none>          443/TCP        30d
# my-app       LoadBalancer   10.96.0.15    203.0.113.50    80:30080/TCP   5d
# internal-api ClusterIP      10.96.0.42    <none>          8080/TCP       5d

For the my-app service, the 80:30080/TCP means: external port 80 coming into the LB maps to the NodePort listening on 30080.

For example, let’s say we trace traffic to a LoadBalancer service with three backend pods:

External client: 198.51.100.5 (public internet)
Load balancer external IP: 203.0.113.50 (provided by the LoadBalancer provisioner)
NodePort: 30080
Service ClusterIP: 10.96.0.15
Backend pods:
- Pod 1: 10.244.0.5 on Node 1 (192.168.1.10)
- Pod 2: 10.244.1.3 on Node 2 (192.168.1.11)
- Pod 3: 10.244.2.2 on Node 3 (192.168.1.12)

The client’s browser connects to http://203.0.113.50 (the load balancer). The client’s TCP/IP stack creates a packet:

Source IP: 198.51.100.5 (the client)
Destination IP: 203.0.113.50 (the load balancer)
Source port: 54321 (ephemeral)
Destination port: 80 (what the load balancer listens on)

The external load balancer receives the packet on its external IP. Then, the load balancer:

Accepts the TCP connection (three way handshake)
It selects a healthy backend node from its pool (nodes with NodePort 30080)
Then it forwards the traffic to that node

The load balancer performs checks against the NodePort to determine if the node is ready to accept traffic.

# Example health check with netcat (what the LB does internally)
# TCP connect to each node on port 30080 to ensure it's responding.
nc -zv 192.168.1.10 30080
nc -zv 192.168.1.11 30080
nc -zv 192.168.1.12 30080

Depending on how the load balancer is configured:

SNAT mode: LB changes source IP to its own IP (this helps you restrict incoming traffic only from the LB. You could also place the source IP into a header X-Forwarded-For and have the client read that if the source IP is important)
DSR/Transparent mode: LB preserves client source IP

The load balancer forwards the packet to Node 1 (192.168.1.10):

Source IP: 198.51.100.5 (client, preserved)
Destination IP: 192.168.1.10 (node)
Destination port: 30080 (NodePort port)

The packet goes to the node’s physical interface (eth0).

The packet first passes through the PREROUTING chain in the iptables nat table. This is where Kubernetes service routing starts.

sudo iptables -t nat -L PREROUTING -n --line-numbers
# Output:
# Chain PREROUTING (policy ACCEPT)
# num  target     prot opt source               destination
# 1    KUBE-SERVICES  all  --  0.0.0.0/0        0.0.0.0/0

From the above, all traffic is sent to the KUBE-SERVICES chain.

The KUBE-SERVICES chain contains rules for all the Services in the cluster. It matches a rule by the destination IP:port combinations.

sudo iptables -t nat -L KUBE-SERVICES -n | head -20
# Output:
# Chain KUBE-SERVICES (2 references)
# target                     prot opt source       destination
# KUBE-SVC-XXXX1             tcp  --  0.0.0.0/0    10.96.0.15    /* default/my-app cluster IP */ tcp dpt:80
# KUBE-NODEPORTS             all  --  0.0.0.0/0    0.0.0.0/0     ADDRTYPE match dst-type LOCAL

For NodePort traffic, the destination is a local node IP, not the ClusterIP. The rule ADDRTYPE match dst-type LOCAL catches this and then goes to the chain KUBE-NODEPORTS. (Dizzy yet?)

This chain matches the actual NodePort numbers:

sudo iptables -t nat -L KUBE-NODEPORTS -n
# Output:
# Chain KUBE-NODEPORTS (1 references)
# target                     prot opt source       destination
# KUBE-EXT-XXXX1             tcp  --  0.0.0.0/0    0.0.0.0/0    /* default/my-app */ tcp dpt:30080

Traffic to port 30080 then moves to the KUBE-EXT-XXXX1 chain for that particular node (external traffic handling for this service).

The KUBE-EXT chain handles external traffic policy and then jumps to the service chain:

sudo iptables -t nat -L KUBE-EXT-XXXX1 -n
# Output (externalTrafficPolicy: Cluster):
# Chain KUBE-EXT-XXXX1 (1 references)
# target                     prot opt source       destination
# KUBE-MARK-MASQ             all  --  0.0.0.0/0    0.0.0.0/0
# KUBE-SVC-XXXX1             all  --  0.0.0.0/0    0.0.0.0/0

KUBE-MARK-MASQ marks the packet for source NAT (SNAT) later. This is necessary because the packet may be forwarded to a pod on a different node.

The KUBE-SVC chain does load balancing across endpoints within the cluster (the different available pods):

sudo iptables -t nat -L KUBE-SVC-XXXX1 -n
# Output:
# Chain KUBE-SVC-XXXX1 (2 references)
# target                     prot opt source       destination
# KUBE-SEP-AAAA1             all  --  0.0.0.0/0    0.0.0.0/0    statistic mode random probability 0.33333333349
# KUBE-SEP-BBBB2             all  --  0.0.0.0/0    0.0.0.0/0    statistic mode random probability 0.50000000000
# KUBE-SEP-CCCC3             all  --  0.0.0.0/0    0.0.0.0/0

Now, probability rules implement random load balancing for picking which pod:

First rule: 33.3% chance (1/3)
Second rule: 50% of remaining (1/2 of 2/3 = 1/3)
Third rule: 100% of remaining (1/3)

Each endpoint gets equal probability.

Assume the random selection chooses KUBE-SEP-BBBB2 (Pod 2 on Node 2):

sudo iptables -t nat -L KUBE-SEP-BBBB2 -n
# Output:
# Chain KUBE-SEP-BBBB2 (1 references)
# target                     prot opt source       destination
# KUBE-MARK-MASQ             all  --  10.244.1.3   0.0.0.0/0
# DNAT                       tcp  --  0.0.0.0/0    0.0.0.0/0    tcp to:10.244.1.3:8080

The DNAT rule rewrites the destination:

Before: dst 192.168.1.10:30080 (the NodePort)
After: dst 10.244.1.3:8080 (the Pod’s actual IP address! We’ve nearly there!)

The packet now has:

Source IP: 198.51.100.5 (client)
Destination IP: 10.244.1.3 (Pod 2)
Destination port: 8080

After PREROUTING, the kernel does some routing magic. The destination 10.244.1.3 is on Node 2, not local to this node. The packet must be forwarded.

ip route get 10.244.1.3
# Output (VXLAN example):
# 10.244.1.3 via 10.244.1.0 dev flannel.1 src 10.244.0.0

The packet will head out the flannel.1 interface to get to Node 2.

The packet passes through the FORWARD chain in the filter table:

sudo iptables -L FORWARD -n | head -10
# Output:
# Chain FORWARD (policy ACCEPT)
# target     prot opt source               destination
# KUBE-FORWARD  all  --  0.0.0.0/0        0.0.0.0/0
# KUBE-SERVICES  all  --  0.0.0.0/0       0.0.0.0/0   ctstate NEW

Before the packet leaves the node, it passes through POSTROUTING in the nat table (Don’t worry if it’s not all familiar to you):

sudo iptables -t nat -L POSTROUTING -n
# Output:
# Chain POSTROUTING (policy ACCEPT)
# target                     prot opt source       destination
# KUBE-POSTROUTING           all  --  0.0.0.0/0    0.0.0.0/0

sudo iptables -t nat -L KUBE-POSTROUTING -n
# Output:
# Chain KUBE-POSTROUTING (1 references)
# target     prot opt source               destination
# RETURN     all  --  0.0.0.0/0            0.0.0.0/0    mark match ! 0x4000/0x4000
# MARK       all  --  0.0.0.0/0            0.0.0.0/0    MARK xor 0x4000
# MASQUERADE all  --  0.0.0.0/0            0.0.0.0/0

The packet was marked by KUBE-MARK-MASQ earlier. MASQUERADE performs SNAT, changing the source IP to the node’s IP:

Before: src 198.51.100.5
After: src 192.168.1.10

The packet now has:

Source IP: 192.168.1.10 (Node 1)
Destination IP: 10.244.1.3 (Pod 2)

The packet is forwarded to Node 2 using the CNI’s cross-node mechanism (VXLAN, BGP, etc.) as described in Part 2.

On Node 2, the packet is decapsulated (if overlay) and routed to Pod 2. The pod receives:

Source IP: 192.168.1.10 (Node 1, because of SNAT)
Destination IP: 10.244.1.3 (Pod 2)
Destination port: 8080

The application sees the request as coming from Node 1, not the original client. The client IP has been lost due to SNAT.

Once the app is done doing what it needs to do with the packet, Pod 2’s application sends a response:

Source IP: 10.244.1.3
Destination IP: 192.168.1.10 (Node 1, from SNAT)
Source port: 8080
Destination port: 54321 (client’s original port, preserved in conntrack)

The destination 192.168.1.10 is Node 1. The packet is forwarded via the CNI.

When the packet arrives at Node 1, the kernel’s connection tracking (conntrack) recognizes it as a reply to an already established connection:

sudo conntrack -L | grep 10.244.1.3
# Output:
# tcp  6 117 TIME_WAIT src=198.51.100.5 dst=192.168.1.10 sport=54321 dport=30080 
#      src=10.244.1.3 dst=192.168.1.10 sport=8080 dport=54321 [ASSURED] mark=0 use=1

The conntrack entry shows the original connection (client to NodePort) and the reply direction (pod to node). The Linux kernel automatically reverses the NAT (kinda neat, eh?):

Un-SNAT: Source 10.244.1.3 becomes 192.168.1.10 (and then to the nodeport perspective)
Un-DNAT: Source 192.168.1.10:30080 (from the client’s perspective)

The packet is then sent back to the client:

Source IP: 192.168.1.10 (Node 1)
Destination IP: 198.51.100.5 (client)
Source port: 30080

The packet returns through the load balancer to the client. The load balancer maintains its own connection state and may perform additional translations to present the external IP (203.0.113.50) of the Load Balancer itself as the source (this hides your internal infrastructure).

The client receives the response from 203.0.113.50:80.

As if that wasn’t enough (and it was a lot), here is some additional information about the behavior of packet routing.

The default behavior (externalTrafficPolicy: Cluster) is to use SNAT, which loses the client IP. But there are other ways.

Traffic can land on any node
If the selected pod is on a different node, traffic is forwarded
SNAT is applied to ensure return traffic comes back through the same node
Client IP is lost
Load is evenly distributed

Traffic only goes to pods on the node that received it
If there are no local pods, the node will actually fail the health checks and the load balancer will stop sending traffic to that node
No SNAT is needed because the traffic stays local to the node.
The Client IP is preserved
Load may not be evenly distributed in cases where nodes have MORE pods so are likely to get more traffic.

apiVersion: v1
kind: Service
metadata:
  name: my-app
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local  # Preserve client IP
  selector:
    app: my-app
  ports:
  - port: 80
    targetPort: 8080

With externalTrafficPolicy: Local, the iptables rules will also change:

sudo iptables -t nat -L KUBE-EXT-XXXX1 -n
# Output (externalTrafficPolicy: Local):
# Chain KUBE-EXT-XXXX1 (1 references)
# target                     prot opt source       destination
# KUBE-SVC-XXXX1             all  --  0.0.0.0/0    0.0.0.0/0

Notice: No KUBE-MARK-MASQ. No, MASQERADEing SNAT will be applied.

The KUBE-SVC chain only contains endpoints local to the node:

# On Node 1, which has Pod 1 (10.244.0.5)
sudo iptables -t nat -L KUBE-SVC-XXXX1 -n
# Output:
# Chain KUBE-SVC-XXXX1 (1 references)
# target                     prot opt source       destination
# KUBE-SEP-AAAA1             all  --  0.0.0.0/0    0.0.0.0/0

In this example, only one endpoint (the local pod) is listed. Nodes without local pods have this:

# On Node 3, which has no pods for this service
sudo iptables -t nat -L KUBE-SVC-XXXX1 -n
# Output:
# Chain KUBE-SVC-XXXX1 (1 references)
# target                     prot opt source       destination
# KUBE-MARK-DROP             all  --  0.0.0.0/0    0.0.0.0/0

The KUBE-MARK-DROP rule causes the packet to be dropped. This will cause the health check to fail since the packet is essentially thrown away. The load balancerwill see this and will stop sending traffic to this particular node.

Use Cluster when:

Client IP is not needed (or you are putting the IP in an X-Forwarded-For header at the LB level)
Even load distribution is wanted
All nodes should be receiving traffic regardless of where the traffic is destined,

Use Local when:

Client IP must be preserved (logging, geolocation, rate limiting)
Application needs to see a real client IP address for whatever reason
It’s ok to have uneven load balancing

I read up that when kube-proxy runs in IPVS mode, the flow is similar to that long path above, but it’s done in a different manner.

Instead of iptables chains, IPVS creates virtual servers that you can check on with the ipvsadm command.

sudo ipvsadm -Ln
# Output:
# IP Virtual Server version 1.2.1 (size=4096)
# Prot LocalAddress:Port Scheduler Flags
#   -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
# TCP  10.96.0.15:80 rr
#   -> 10.244.0.5:8080              Masq    1      2          0
#   -> 10.244.1.3:8080              Masq    1      1          0
#   -> 10.244.2.2:8080              Masq    1      3          0
# TCP  192.168.1.10:30080 rr
#   -> 10.244.0.5:8080              Masq    1      2          0
#   -> 10.244.1.3:8080              Masq    1      1          0
#   -> 10.244.2.2:8080              Masq    1      3          0

IPVS handles both ClusterIP (10.96.0.15:80) and NodePort (192.168.1.10:30080) Service types as virtual servers.

IPVS mode still uses iptables under the hood for a few cases:

Masquerading (SNAT)
Packet marking
NodePort handling on all node IPs

For some more torture, here you go:

sudo iptables -t nat -L KUBE-POSTROUTING -n
# Output (IPVS mode):
# Chain KUBE-POSTROUTING (1 references)
# target     prot opt source               destination
# MASQUERADE all  --  0.0.0.0/0            0.0.0.0/0    match-set KUBE-LOOP-BACK dst,dst,src

IPVS supports multiple Load Balancing algorithms (e.g. rr = roundrobin)

# View current scheduler
sudo ipvsadm -Ln | grep "TCP  10.96"
# Output shows scheduler algorithms that are available: rr, lc, dh, sh, sed, nq

# Let's see how it's configured in the kube-proxy config
kubectl get configmap kube-proxy -n kube-system -o yaml | grep scheduler
# Output: scheduler: "rr"

Let’s move a little further out from the level of iptables and ipvs and examine the Ingress Controller. This resource adds another hop to the flow of traffic. Traffic flows:

Client → Load Balancer → NodePort → Ingress Controller Pod → Backend Pod

An Ingress Controller (nginx, envoy, traefik, etc.) runs as pods in the cluster
Those pods are exposed via a LoadBalancer or NodePort service (so that it routes traffic to the Ingress)
Ingresses let you route based upon things like host or path to the backend services (so that a specific host name or URL will route to a difference running app).
The controller receives the traffic and proxies this traffic to backends based upon Ingress rules

kubectl get ingress
# Output:
# NAME      CLASS   HOSTS           ADDRESS         PORTS   AGE
# my-app    nginx   app.example.com 203.0.113.50    80      5d

A client browser resolves, e.g., app.example.com to 203.0.113.50 (The Ingress Load Balander IP)
The traffic arrives at LoadBalancer
The load balancer forwards to the NodePort of the Ingress Controller Service
iptables routes this traffic to an Ingress Controller pod (as we have already discussed)
The Ingress Controller examines the Host header and path
The Controller then opens new connection to the backend service (ClusterIP)
iptables routes to the backend pod as we’ve discussed before.
The response returns back through the controller to the client

The Ingress Controller terminates the original connection and creates a new one, providing L7 routing capabilities.

# View Ingress Controller pods and their node placement
kubectl get pods -n ingress-nginx -o wide
# Output:
# NAME                                        READY   STATUS    IP           NODE
# ingress-nginx-controller-5c8d66c76d-abc12   1/1     Running   10.244.0.8   node-1
# ingress-nginx-controller-5c8d66c76d-def34   1/1     Running   10.244.1.9   node-2

# Watch packet counts through service chains
sudo iptables -t nat -L KUBE-SVC-XXXX1 -n -v
# Output:
# Chain KUBE-SVC-XXXX1 (2 references)
#  pkts bytes target     prot opt in     out     source               destination
#   847  50K KUBE-SEP-AAAA1  all  --  *      *   0.0.0.0/0            0.0.0.0/0    statistic mode random probability 0.333
#   823  49K KUBE-SEP-BBBB2  all  --  *      *   0.0.0.0/0            0.0.0.0/0    statistic mode random probability 0.500
#   851  51K KUBE-SEP-CCCC3  all  --  *      *   0.0.0.0/0            0.0.0.0/0

# Watch connection tracking for a specific service
sudo conntrack -E -p tcp --dport 30080
# Output (live events):
# [NEW] tcp      6 120 SYN_SENT src=198.51.100.5 dst=192.168.1.10 sport=54321 dport=30080
# [UPDATE] tcp   6 60 SYN_RECV src=198.51.100.5 dst=192.168.1.10 sport=54321 dport=30080
# [UPDATE] tcp   6 432000 ESTABLISHED src=198.51.100.5 dst=192.168.1.10 sport=54321 dport=30080

tcpdump is your friend here:

# At the node's physical interface (incoming)
sudo tcpdump -i eth0 -nn port 30080

# At the bridge (after DNAT, before forwarding)
sudo tcpdump -i cni0 -nn port 8080

# At the VXLAN interface (cross-node traffic)
sudo tcpdump -i flannel.1 -nn port 8080

# Inside the pod
kubectl exec -it my-pod -- tcpdump -i eth0 -nn port 8080

# Enable iptables tracing (verbose, use sparingly)
sudo iptables -t raw -A PREROUTING -p tcp --dport 30080 -j TRACE
sudo iptables -t raw -A OUTPUT -p tcp --sport 8080 -j TRACE

# View trace in kernel log
sudo dmesg -w | grep TRACE

# Clean up when done
sudo iptables -t raw -D PREROUTING -p tcp --dport 30080 -j TRACE
sudo iptables -t raw -D OUTPUT -p tcp --sport 8080 -j TRACE

Verify that LoadBalancer does, in fact, have an external IP address:

kubectl get svc my-app
# Check EXTERNAL-IP is not <pending>

Verify that the NodePort is open:

# From a node
ss -tlnp | grep 30080
# Output should show kube-proxy listening

Check that the endpoints exist (you probably won’t have to do this much if ever):

kubectl get endpoints my-app
# Output:
# NAME     ENDPOINTS                                         AGE
# my-app   10.244.0.5:8080,10.244.1.3:8080,10.244.2.2:8080   5d

Verify iptables rules:

sudo iptables -t nat -L KUBE-SERVICES -n | grep my-app

Check externalTrafficPolicy:

kubectl get svc my-app -o jsonpath='{.spec.externalTrafficPolicy}'
# Output: Cluster (means SNAT is applied)

Change to Local if client IP needed:

kubectl patch svc my-app -p '{"spec":{"externalTrafficPolicy":"Local"}}'

Verify pods are running on nodes receiving traffic:

kubectl get pods -o wide -l app=my-app

Check if SNAT is happening when it actually shouldn’t:

sudo conntrack -L -d <pod-ip> | head
# Is the source IP the client's or the node's?

Verify that the CNI is forwarding cross-node traffic:

# On source node
sudo tcpdump -i flannel.1 -nn host <pod-ip>

Check that the pod is healthy:

kubectl describe pod <pod-name> | grep -A5 Conditions

North-south traffic through a LoadBalancer service follows this path:

Client connects to external load balancer IP address
Load balancer forwards to the NodePort on a healthy node
iptables PREROUTING/KUBE-SERVICES chains intercept the packet
KUBE-SVC chain randomly selects a backend pod (this is the load balancing decision)
KUBE-SEP chain performs DNAT to the pod IP
If the pod is on a different node, SNAT is applied (externalTrafficPolicy: Cluster)
Packet is forwarded to the pod via CNI
Return traffic uses conntrack to reverse the NAT translations

Two choices for the configuration:

externalTrafficPolicy: Cluster: Even load distribution, loses client IP
externalTrafficPolicy: Local: Preserves client IP, may have uneven distribution

Part 4 will cover encryption in flight: where TLS terminates, CNI-level encryption options, and how to achieve end-to-end encryption without a service mesh.

Service: https://kubernetes.io/docs/concepts/services-networking/service/
Service Types: https://kubernetes.io/docs/concepts/services-networking/service/#publishing-services-service-types
External Traffic Policy: https://kubernetes.io/docs/tasks/access-application-cluster/create-external-load-balancer/#preserving-the-client-source-ip
Ingress: https://kubernetes.io/docs/concepts/services-networking/ingress/
Ingress Controllers: https://kubernetes.io/docs/concepts/services-networking/ingress-controllers/

kube-proxy Modes: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-proxy/
IPVS Proxy Mode: https://kubernetes.io/docs/concepts/services-networking/service/#proxy-mode-ipvs
Virtual IPs and Service Proxies: https://kubernetes.io/docs/reference/networking/virtual-ips/

iptables: https://netfilter.org/documentation/
iptables-extensions (statistic module): https://man7.org/linux/man-pages/man8/iptables-extensions.8.html
conntrack: https://conntrack-tools.netfilter.org/
conntrack man page: https://man7.org/linux/man-pages/man8/conntrack.8.html
IPVS: http://www.linuxvirtualserver.org/software/ipvs.html
ipvsadm: https://man7.org/linux/man-pages/man8/ipvsadm.8.html

AWS ELB: https://docs.aws.amazon.com/elasticloadbalancing/
GCP Load Balancing: https://cloud.google.com/load-balancing/docs
Azure Load Balancer: https://docs.microsoft.com/en-us/azure/load-balancer/

MetalLB: https://metallb.universe.tf/

NGINX Ingress Controller: https://kubernetes.github.io/ingress-nginx/
Traefik: https://doc.traefik.io/traefik/providers/kubernetes-ingress/
Envoy/Contour: https://projectcontour.io/
HAProxy Ingress: https://haproxy-ingress.github.io/

Kubernetes Networking Deep Dive, Part 3