Introduction: When Fixed Replica Counts Aren't Enough

In previous lessons, you've been setting a fixed number of replicas in your deployments — maybe 2 or 3 pods to handle your application's traffic. But what happens when traffic suddenly doubles at 3 AM because your product goes viral on social media? Your fixed replica count means users experience slow response times or timeouts while your servers struggle under the load. By the time someone wakes up, checks monitoring dashboards, and manually runs kubectl scale to add more replicas, you've already lost customers and revenue.

Conversely, during quiet periods, those extra replicas you added sit idle, wasting cluster resources and money. In this lesson, you'll learn how Horizontal Pod Autoscaling (HPA) solves this problem by automatically adjusting replica counts based on actual resource usage, ensuring your application scales up during traffic spikes and scales down during quiet periods without any manual intervention.

What is Horizontal Pod Autoscaling?

Horizontal Pod Autoscaling is a Kubernetes resource that continuously monitors your pods and automatically adjusts the number of replicas in a Deployment, ReplicaSet, or StatefulSet based on observed metrics like CPU or memory usage. Think of HPA as a control system that runs in a continuous loop: it checks the current resource utilization, compares it to your target threshold, calculates how many replicas are needed to meet that target, and then adjusts the replica count accordingly. This loop runs every 15 seconds by default (though you typically see changes reflected in kubectl output every 30–60 seconds due to metric collection delays). The beauty of HPA is that it makes scaling decisions using the same API calls you'd make manually with kubectl scale, but it does so automatically, consistently, and without requiring a human to watch dashboards.

When you create an HPA resource, you're essentially telling Kubernetes: "Watch this Deployment and keep the average CPU utilization across all its pods around 70%. If it goes higher, add more replicas to spread the load. If it drops lower, remove replicas to free up resources." The HPA controller running in the Kubernetes control plane handles all the complexity of gathering metrics from the metrics-server (a cluster component that collects resource usage data from each node), performing calculations to determine the optimal replica count, and updating the Deployment's replica field. You just define what you want, and HPA makes it happen.

Development Note: In the practice exercises, you'll patch metrics-server with the --kubelet-insecure-tls flag to make it work in Kind's development environment. This flag bypasses TLS verification and should never be used in production — real clusters must configure proper TLS certificates for kubelet communication. This is purely a Kind/local development workaround.

The Critical Prerequisite: Resource Requests

Here's something crucial that often trips up newcomers: HPA cannot work without resource requests defined in your Deployment. Remember in Lesson 2 when you learned about resource requests and limits? You learned that requests help Kubernetes schedule pods on nodes with enough capacity. Now you're about to discover another critical reason to always set resource requests: HPA uses them as the baseline for calculating utilization percentages. When you tell HPA to maintain 70% CPU utilization, it needs to know "70% of what?" The answer is: 70% of the CPU request you defined for each container.

There's also a second prerequisite: the metrics API must be available in your cluster. HPA retrieves current resource utilization data from the metrics API, which is typically provided by the metrics-server component. Without metrics-server running and healthy, HPA has no way to know how much CPU or memory your pods are currently using — it can only see the requests you defined, not the actual consumption. Most managed Kubernetes services (like EKS, GKE, or AKS) include metrics-server by default, but if you're running a local cluster with Kind or setting up your own cluster, you may need to install it separately. When metrics-server isn't available, kubectl get hpa will show <unknown> for current utilization values, and HPA won't make any scaling decisions.

Let's make this concrete with an example. If your container has a CPU request of 100m (100 millicores, or 0.1 CPU cores), and HPA is configured to maintain 70% utilization, HPA will try to keep each pod using around 70m of CPU on average. If your pods start using 90m each (90% utilization), HPA sees that you're exceeding the target and adds more replicas to spread the load. If usage drops to 40m per pod (40% utilization), HPA removes replicas because you're using less than the target threshold. Without a CPU request defined, has no reference point — it can't calculate a percentage without knowing what 100% means. The will show "unknown" for current metrics and won't make any scaling decisions.

Building a Deployment Ready for Autoscaling

Before you can create an HPA, you need a Deployment that's properly configured with resource requests. Let's build a deployment for a simple nginx web application that includes everything HPA needs to function. Create a new file called deployment-scalable.yaml and start with the basic deployment structure:

This looks familiar from previous lessons — we're creating a Deployment named web-app-scalable that will manage pods with the label app: web-app. We're starting with replicas: 2 as a baseline. This is important because it gives HPA room to scale in both directions: it can add replicas when load increases and remove replicas when load decreases. Starting with just one replica would mean HPA can only scale up, and starting with too many replicas might waste resources initially. Two replicas are a sensible default that provides high availability while leaving room for HPA to make adjustments.

Now let's add the pod template that defines the container configuration:

So far, this is a standard pod template: we're running the nginx:1.25 image, labeling the pod with so the Deployment can manage it, and exposing port 80 for HTTP traffic. The critical part comes next — adding resource requests:

Configuring the HPA Resource

Now that you have a Deployment with resource requests, you can create an HPA resource to control its scaling behavior. Create a new file called hpa.yaml. Let's build this configuration step by step, starting with the basic resource definition:

The apiVersion is autoscaling/v2, which is the current stable version of the HPA API that supports multiple metric types, including CPU, memory, and custom metrics. The kind is HorizontalPodAutoscaler, which tells Kubernetes you're creating an HPA resource. We're naming it web-hpa to clearly indicate its purpose. Note that the HPA is a separate resource from the Deployment — you could delete and recreate the HPA without affecting the running pods, though autoscaling would stop working until you recreate it.

The most important section of the HPA configuration is the scaleTargetRef, which tells HPA which resource to control:

The scaleTargetRef section creates the connection between your HPA and the Deployment. The apiVersion: apps/v1 and fields specify that you're targeting a Deployment resource. The field must exactly match the name you used in . This is how knows which Deployment's replica count to adjust. You can only target one resource per , but you can create multiple resources if you need to autoscale multiple deployments.

Applying the Deployment and HPA

Let's deploy both the Deployment and the HPA to your cluster. Start by applying the deployment configuration:

You should see confirmation that the deployment was created:

Verify the pods started successfully:

You should see two pods in Running status:

Now create the HPA resource:

You'll see confirmation:

The HPA controller has now registered your HPA and is beginning its observation loop.

Understanding HPA Status Output

Check the current status of your HPA:

The output shows the current state:

Let's break down each column:

  • REFERENCE: Shows which resource this HPA is controlling (Deployment/web-app-scalable)
  • TARGETS: Most important column — shows current CPU utilization vs. target (0%/70%)
    • 0% is the current utilization (nginx has no traffic yet)
    • 70% is your configured target threshold
  • MINPODS: Minimum replicas configured (2)
  • MAXPODS: Maximum replicas configured (5)
  • REPLICAS: Current replica count (2, matching minReplicas since there's no load)
  • AGE: How long the HPA has existed

If you see <unknown>/70% in the TARGETS column initially, the metrics-server hasn't collected enough data yet. Wait 30–60 seconds and run the command again — the metrics-server collects data every 15 seconds but needs a few cycles to report reliable averages.

Detailed HPA Inspection and Monitoring

For detailed information about the HPA's current state and decision-making process:

The output provides extensive configuration and status information:

The Metrics section shows: resource cpu on pods (as a percentage of request): 0% (0) / 70%. The 0% (0) means current utilization is 0% with an absolute value of 0 millicores, compared to the target of 70%.

The Conditions section shows the HPA's operational state using three condition types:

  • AbleToScale: True — HPA is ready to make scaling decisions
  • ScalingActive: True — Metrics are available and HPA is functioning normally
  • ScalingLimited: False — Current replica count is within min/max range and not being constrained

These conditions help you troubleshoot when autoscaling isn't working as expected.

Summary: Self-Adjusting Applications

You've now learned how to implement Horizontal Pod Autoscaling, the final piece in your Kubernetes reliability toolkit. HPA continuously monitors resource utilization and automatically adjusts replica counts to match demand, eliminating the need for manual intervention during traffic spikes or quiet periods. You built a deployment with proper resource requests (the critical prerequisite for HPA), configured an HPA resource with scaling boundaries and CPU utilization targets, and learned how to monitor autoscaling behavior using kubectl commands.

Combined with namespace isolation, resource management, and health probes from previous lessons, you now understand the complete reliability stack that enables production-grade, self-healing, and self-adjusting applications. In the upcoming practice exercises, you'll generate load on your deployment and watch HPA make real-time scaling decisions, cementing your understanding of how automatic scaling responds to actual resource usage patterns.

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal