In the previous lesson, you learned about the fundamental concepts of Google Cloud Run and its serverless model. You've also practiced deploying Cloud Run services using the gcloud run deploy command with default settings. While those services work perfectly, production workloads often require specific resource allocations and scaling configurations to optimize performance and cost.
In this lesson, you'll take your service configuration skills to the next level by learning about resource allocation and scaling settings. These concepts determine how your containers utilize compute resources and respond to traffic, giving you control over cost optimization and performance. By the end of this lesson, you'll understand how to configure CPU allocation, memory limits, concurrency settings, and scaling parameters that match your application's needs.
Your main goal is to create Cloud Run services with explicit resource and scaling configurations and inspect their settings using gcloud commands. This hands-on approach will help you understand how Cloud Run manages resources and prepare you to make informed decisions about container deployment strategies in real-world scenarios.
When you deploy containers on Cloud Run, you must specify resource allocation settings that determine how much CPU and memory your container instances receive. Think of resource allocation as defining the compute capacity available to each container instance. The resources you allocate affect everything from cost to performance and response times.
CPU allocation in Cloud Run comes in two modes: CPU is always allocated and CPU is only allocated during request processing. When CPU is always allocated, your container has continuous access to CPU resources even when not handling requests, which is ideal for background processing or applications that need to maintain state. When CPU is only allocated during request processing (the default), you only pay for CPU while actively handling requests, making it more cost-effective for request-driven workloads.
Memory limits define the maximum amount of RAM available to each container instance. Cloud Run supports memory allocations from 128 MiB to 32 GiB, depending on your application's needs. Key considerations include:
- Application requirements: Choose memory based on your application's actual needs, including any in-memory caching or data processing.
- Cost efficiency: You pay for the memory you allocate, so right-sizing is important.
- Performance: Insufficient memory leads to container restarts and errors, while excessive allocation wastes resources.
The trade-offs are clear: Higher resource allocations provide better performance and can handle more complex workloads, but increase costs. Lower allocations reduce costs but may limit your application's capabilities. Cloud Run excels at allowing you to fine-tune these settings to match your specific workload characteristics.
Concurrency determines how many simultaneous requests a single container instance can handle. This setting is crucial for optimizing resource utilization and cost. Cloud Run supports concurrency values from 1 to 1,000, with 80 as the default.
Setting concurrency to 1 means each container instance handles only one request at a time, which is ideal for applications that aren't thread-safe or need exclusive access to resources. Higher concurrency values allow multiple requests to be processed simultaneously by the same container instance, reducing the number of instances needed and lowering costs for applications that can safely handle concurrent requests.
It's important to consider the trade-off between throughput and latency. While a higher concurrency setting increases the number of requests a single instance can handle (improving throughput and reducing costs), it can also increase the processing time for each individual request. If an instance is overloaded, requests may queue up and eventually exceed the configured request timeout, leading to errors. Therefore, you must balance concurrency with your application's ability to process requests efficiently to avoid increased latency and timeouts.
Scaling settings control how Cloud Run automatically adjusts the number of container instances based on incoming traffic. These settings include:
- Minimum instances: The number of container instances kept running even with no traffic. Setting a minimum greater than
0reduces cold start latency but increases costs since you pay for idle instances. - Maximum instances: The upper limit on container instances, preventing runaway scaling and controlling costs. This also helps you stay within quota limits.
Cloud Run automatically scales between your minimum and maximum based on incoming request volume and the concurrency setting. If each instance can handle 80 concurrent requests and you receive requests simultaneously, will scale up to instances (assuming no minimum/maximum constraints).
Now let's create a Cloud Run service with explicit resource and scaling configuration. This approach gives you precise control over how your service utilizes compute resources and responds to traffic patterns.
This command creates a service named my-service using the gcr.io/cloudrun/hello container image. The --cpu 1 parameter allocates 1 vCPU to each container instance, while --memory 512Mi sets the memory limit to 512 MiB. The --concurrency 80 setting allows each instance to handle up to 80 simultaneous requests. The scaling parameters --min-instances 0 and --max-instances 10 configure the service to scale from zero (no idle instances) up to a maximum of 10 instances.
When you run this command, you'll see output similar to this:
The service is now running with your specified resource configuration. will automatically scale instances between and based on incoming traffic, with each instance capable of handling concurrent requests using vCPU and MiB of memory.
Cloud Run allows you to control when CPU is allocated to your container instances. By default, CPU is throttled and only allocated during request processing. However, you can configure it to be always allocated for applications that need continuous CPU access.
To create a service with always-allocated CPU, use the --no-cpu-throttling flag:
The --no-cpu-throttling flag ensures CPU is always allocated, even when the container instance is not processing requests. This is essential for applications that perform background processing, maintain WebSocket connections, or need to respond immediately without CPU throttling. Note that always-allocated CPU increases costs since you pay for CPU time even when idle.
For standard request-driven applications, you can explicitly use the default setting, which allocates CPU only during request processing:
The --cpu-throttling flag (which is the default behavior) ensures CPU is only allocated during active request processing. This is more cost-effective for applications that spend most of their time idle, waiting for requests. Between requests, the container remains running but with its CPU allocation significantly reduced.
It's important not to confuse the CPU allocation mode with CPU Boost. CPU Boost is a separate feature that helps reduce cold start latency by temporarily increasing CPU allocation during instance startup and for a short period afterward. This allows your application to initialize faster.
- CPU Allocation Mode (
--cpu-throttling/--no-cpu-throttling): Determines if CPU is allocated continuously or only during requests. - CPU Boost (
--cpu-boost/--no-cpu-boost): Determines if extra CPU is provided during instance startup.
You can enable CPU Boost for services with either CPU allocation mode to improve their startup performance. For example, to create a request-driven service with faster cold starts:
After creating services with different resource configurations, you can inspect and compare them to understand how your settings translate into actual service behavior. The gcloud run services describe command provides detailed information about your service configuration.
This command returns the resource limits for your service:
Note that CPU is shown in millicores (1000m = 1 vCPU). You can also inspect the concurrency and scaling settings:
This shows the concurrency setting and annotations that include scaling parameters:
To get a comprehensive view of all configuration settings, use:
This returns a detailed YAML representation of your service's configuration, including CPU allocation mode, resource limits, concurrency, and scaling settings. When planning deployments, always verify these settings to ensure your service configuration matches your application's requirements for performance, cost, and scalability.
You can modify resource and scaling configurations on existing services using the gcloud run services update command. This is useful when your application requirements change or you want to adjust performance and cost optimization strategies.
This command updates my-service to use 2 vCPUs and 1 GiB of memory per instance, increases concurrency to 100 requests per instance, and raises the maximum instance count to 20. Cloud Run handles the update with zero downtime by gradually shifting traffic from old revisions to the new revision.
You can also update just the scaling parameters without changing resource allocations:
This sets a minimum of 2 instances (reducing cold start latency) and increases the maximum to 50 instances to handle traffic spikes. Changes take effect immediately for new container instances, and Cloud Run automatically manages the transition.
To change the CPU allocation mode for an existing service to be always allocated:
This enables always-allocated CPU for the service. Remember that this increases costs but improves performance for applications that need continuous CPU access. To revert to the default, throttled behavior, you would use the flag.
In this lesson, you've learned how to configure Cloud Run services with specific resource allocations and scaling settings. You've gained hands-on experience creating services with custom CPU, memory, concurrency, and scaling parameters, and learned how to inspect and update these configurations using gcloud commands.
You can now create services that balance cost optimization with performance requirements, using configurations that match your application's resource needs and traffic patterns. You understand how CPU allocation modes, memory limits, concurrency settings, and scaling parameters work together to determine your service's behavior and cost.
In the upcoming practice exercises, you'll apply these concepts by deploying services with different configurations and observing how they respond to various traffic patterns. You'll learn how to monitor resource utilization, optimize configurations based on real-world performance data, and implement traffic management strategies for zero-downtime deployments.
