Systematic Troubleshooting in Kubernetes

Introduction: The Reality of Operations

You've spent the last four lessons building a solid foundation for reliable Kubernetes applications — organizing environments with namespaces, managing resources with requests and limits, implementing health checks with probes, and enabling automatic scaling with HPA. But here's the reality: even with all these reliability features in place, things will still break. A deployment might not start, pods might crash unexpectedly, or services might fail to communicate with each other.

When these issues happen in production at 2 a.m., you need a systematic way to diagnose problems quickly and confidently. In this lesson, you'll learn a repeatable troubleshooting workflow that guides you from "something's broken" to "I know exactly what's wrong" using eight diagnostic steps that work for most operational problems in Kubernetes.

The Application Under Test: A Multi-Tier Setup

To practice realistic troubleshooting, you need an application that mirrors what you'll encounter in production. Most real-world applications aren't single deployments — they're multi-tier systems with frontend components that talk to backend services, databases, and external APIs. We're going to deploy a simplified version of this architecture: a frontend service running nginx that theoretically communicates with a backend service, also running nginx for simplicity. While both tiers use the same image in this example, they represent distinct layers of your application stack, and each has its own deployment and service resource.

Let's build this setup step by step, starting with the frontend deployment. Create a file called frontend-deployment.yaml:

This creates a deployment named frontend with two replicas for high availability. Now add the selector and pod template:

The selector tells the deployment to manage pods labeled with app: frontend, and the template defines those pods with an nginx container exposing port 80. Save this file — it's a standard deployment you've seen in previous lessons. Next, create frontend-service.yaml to expose this deployment:

This resource creates a stable endpoint called that routes traffic to any pod with the label on port 80. The service abstraction is critical because it gives us a predictable DNS name () even as individual pods come and go. Now, let's create the backend tier. Create :

Starting the Investigation: Deployments and Pods

When something goes wrong, don't immediately jump to logs or start describing random pods. Start with the big picture and progressively drill down to specific problems. The first step in any troubleshooting workflow is checking whether your deployments have successfully created the pods you expected. Run this command to see the status of all deployments:

You'll see a table showing all deployments in the current namespace:

The READY column is the key indicator here — it shows current/desired replicas. For the frontend deployment, 2/2 means you wanted two replicas and you have two ready replicas. For backend, 1/1 means your single replica is ready. The UP-TO-DATE column shows how many replicas are running the latest pod template (important after configuration changes), and AVAILABLE shows how many replicas are ready to serve traffic. If you see 0/2 or 1/2 in the READY column, you know immediately that some pods aren't starting correctly, and you need to investigate why.

This deployment-level view answers the fundamental question: "Did Kubernetes successfully create my application?" If all your deployments show the expected replica counts, the problem likely isn't with resource creation — it might be networking, configuration, or application-level issues. If deployments aren't reaching their desired state, you've found the starting point for your investigation. Now, let's drill down to the individual pods that deployments manage. Check the frontend pods specifically:

The -l app=frontend flag filters pods by label, showing only the frontend tier:

Deeper Diagnosis: Pod Inspection and Logs

When you identify a problematic pod from the previous step, the next move is to inspect its detailed configuration and recent history using commands you learned in the Health Probes lesson. The kubectl describe pod command provides comprehensive information about what happened to a specific pod. Let's inspect one of the frontend pods (replace <POD_NAME> with an actual pod name from your kubectl get pods output):

This command outputs extensive details, but for troubleshooting, focus on the Events section at the bottom. Here's what a healthy pod's events look like:

All events have Type: Normal, meaning everything went smoothly. When troubleshooting a broken pod, scan for events with Type: Warning or messages containing "Error" or "Failed" — these reveal infrastructure-level problems. Common patterns include image pull failures (Warning Failed ... Failed to pull image "nginx:1.99": rpc error: code = NotFound), which tell you the image name is wrong or the tag doesn't exist. Resource constraints appear as Warning FailedScheduling ... 0/3 nodes are available: insufficient cpu, meaning no nodes have enough resources to satisfy the pod's requests. Probe failures show up as Warning Unhealthy ... Liveness probe failed: HTTP probe failed with statuscode: 500, followed by Normal Killing ... Container nginx failed liveness probe, will be restarted. Each error pattern points directly to a specific type of problem.

Network Troubleshooting: Services and Connectivity

Even when pods are running perfectly, they might not be reachable through their services. This happens when service selectors don't match pod labels or when network policies block traffic. The first step in network troubleshooting is verifying that services can actually find their target pods. The endpoints resource shows which pod IPs a service is routing to:

You'll see output showing the actual IP addresses of pods backing this service:

The ENDPOINTS column lists the IP:port combinations of pods that match the service's selector. In this case, two pod IPs are listed (matching our two frontend replicas), each listening on port 80. This confirms the service successfully found its target pods. If you see <none> in the ENDPOINTS column, it means the service selector doesn't match any running pods — this is often caused by label mismatches between the service's spec.selector and the pod's metadata.labels.

Let's verify the backend service's endpoints too:

You should see one endpoint for the single backend pod:

Now you know both services have valid endpoints, meaning service discovery is working correctly. But what if you need to test whether one service can actually reach another? In multi-tier applications, the frontend often needs to call the backend. Kubernetes internal DNS should make this work automatically — the frontend pods should be able to reach the backend at http://backend:80.

To test this, we can't simply run curl from our local terminal. Cluster IPs are internal to the Kubernetes virtual network and are not reachable from outside the cluster network by default. To accurately simulate how your application components communicate, we must test from the cluster network. We do this by spinning up a temporary pod:

Local Debugging: Port Forwarding for Direct Access

Sometimes you need to test a service directly from your local machine without going through Kubernetes networking. The kubectl port-forward command creates a tunnel from your laptop to a Kubernetes service or pod, allowing you to access it as if it were running locally. This is particularly useful for debugging services that aren't exposed outside the cluster or for quickly testing configuration changes without setting up ingress rules:

This command creates a port forward from your local port 8080 to the frontend service's port 80. The syntax is local-port:service-port. When you run this, you'll see:

This means kubectl is now actively forwarding traffic. Keep this terminal window open — the port forward only works while the command is running. Now open a web browser or a second terminal and access http://localhost:8080. If the frontend service is working correctly, you'll see the nginx welcome page. You're connecting directly to the service through the Kubernetes API server, bypassing all external networking complexity.

Port forwarding is valuable because it isolates connectivity problems. If port forwarding works but external access doesn't, you know the issue is in your ingress or load balancer configuration, not in the pods or services themselves. You can also forward directly to a specific pod instead of a service:

This bypasses the service entirely and connects straight to one pod, which helps determine if the problem is with the service routing or with individual pods. To stop port forwarding, press Ctrl+C in the terminal running the command. The tunnel closes immediately and your local port becomes available again.

Port forwarding is primarily a debugging tool, not a production access method. It requires your command to keep running, it only forwards to one backend pod (not load balanced), and it depends on your local machine's connectivity to the Kubernetes API server. For actual production traffic, you'd use Services, Ingress, or LoadBalancer resources. But for troubleshooting, port forwarding gives you quick, direct access to test whether pods are responding correctly.

External Access Overview

So far, we've focused on troubleshooting problems within the cluster. But what happens when users outside the cluster try to access your application? That's where LoadBalancer and Ingress resources come into play. A LoadBalancer service type requests an external load balancer from your cloud provider that routes internet traffic to your service. An Ingress resource acts as an HTTP router at the cluster edge, allowing you to expose multiple services through a single external endpoint with path-based or host-based routing rules. When external users report they can't access your application, you need to verify that the external access layer is correctly configured and pointing to healthy internal services.

Troubleshooting Ingress Configuration

An Ingress resource defines routing rules that tell Kubernetes how to direct external HTTP traffic to your internal services. Here's what a simple Ingress looks like:

The backend section is critical for troubleshooting — it must reference an actual service name and port that exists in your cluster. If the service name is wrong or the port doesn't match, external traffic can't reach your application even though pods and services are healthy. To see all Ingress resources, run:

You'll see output showing configured Ingress resources:

The HOSTS column shows which domain names this Ingress responds to. The ADDRESS column shows the external IP where traffic should be sent — if this is empty or <pending>, your ingress controller might not be running. To examine detailed routing rules, use describe:

The describe output shows the routing configuration:

The Backends section is crucial — it shows which service and port the Ingress routes to, plus the actual pod IPs. If you see <error: endpoints "frontend" not found>, the Ingress can't find the service. If the service exists but shows no pod IPs, the service has no endpoints (likely a selector mismatch).

Common Ingress problems you'll encounter:

The Complete Troubleshooting Workflow

You've now learned eight distinct diagnostic steps. Let's bring them together into a systematic troubleshooting workflow that you can apply to any Kubernetes problem. The key is following these steps in order, starting broad and drilling down progressively:

Step 1: Check Deployment Status — Run kubectl get deployments to see if your desired replica count matches the current state. This immediately tells you if pods are being created successfully. If you see 0/2 or 1/3, you know pod creation is failing and you need to investigate why.
Step 2: Check Pod Status — Run kubectl get pods -l app=<label> to see the status of individual pods. Look at the STATUS column for Running, Pending, CrashLoopBackOff, or Error states. Check the RESTARTS column for repeated failures. This tells you which specific pods need deeper investigation.
Step 3: Inspect Pod Details — For any problematic pod, run kubectl describe pod <pod-name> and focus on the Events section at the bottom. Look for Warning events or messages containing "Error" or "Failed." Events reveal infrastructure problems like image pull failures, resource constraints, or probe failures.
Step 4: View Application Logs — Run kubectl logs <pod-name> to see what the application itself is reporting. If the pod is restarting, use to see logs from the crashed container. Logs reveal application-level problems like configuration errors, connection failures, or code bugs.

Summary: Your Operational Troubleshooting Toolkit

You now have a complete, systematic troubleshooting workflow that handles most operational problems in Kubernetes. This eight-step process guides you from high-level deployment status down to detailed application logs, network connectivity, and external access configuration, ensuring you don't miss critical diagnostic information. You learned how to interpret deployment status, pod phases, and restart counts to identify which components are failing.

You practiced using kubectl describe to find infrastructure events and kubectl logs to see application output. You verified service discovery with endpoints and tested connectivity with temporary curl pods. You used port forwarding to access services directly from your local machine. Finally, you learned to verify external access through Ingress resources.

The upcoming practices focus on diagnosis only — your goal is to run the diagnostic commands and identify what's wrong, not to fix the problems. The test validation checks that you've run the correct diagnostic commands and understand the broken state.

Previous Lesson

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal