Skip to content

Automatic Scaling

Runway automatically manages the scaling of your services to ensure optimal performance and resource efficiency. This document explains how automatic scaling works and how you can configure it for your service.

Runway uses two complementary scaling mechanisms that work together to optimize your service:

Horizontal Pod Autoscaler (HPA) automatically adjusts the number of pod replicas based on CPU utilization. When your service experiences high load, HPA creates additional pods to handle the traffic. When load decreases, it removes unnecessary pods.

What you control:

  • Minimum instances: The lowest number of pods (default: 3)
  • Maximum instances: The upper limit for scaling (default: 20)
  • CPU utilization target: The desired CPU usage percentage (default: 70%)

Vertical Pod Autoscaler (VPA) automatically adjusts memory requests and limits for your pods based on actual usage patterns. It observes your service’s memory consumption over time and right-sizes the allocation to prevent waste while ensuring adequate resources.

What you control:

  • Memory limit: The maximum memory your service can use (default: 1Gi)
  • This serves as a safety ceiling for VPA operations

Configure scaling in your .runway/${RUNWAY_SERVICE_ID}/default-values.yaml file:

spec:
services:
frontend:
# Horizontal scaling configuration
scalability:
min_instances: 3 # Minimum number of pods
max_instances: 20 # Maximum number of pods
cpu_utilization: 70 # Target CPU percentage for scaling
# Resource configuration (affects vertical scaling)
resources:
requests:
cpu: "500m" # Initial CPU request
memory: "512Mi" # Initial memory request (VPA will adjust)
limits:
cpu: "1000m" # CPU limit
memory: "1Gi" # Memory limit (VPA ceiling)

CPU utilization directly correlates with request load. When your service processes more requests, CPU usage increases, triggering HPA to add more pods. This provides immediate capacity for handling traffic spikes.

Memory usage is typically stable and workload-specific. VPA optimizes memory allocation to match your service’s actual needs, preventing both under-provisioning (which causes crashes) and over-provisioning (which wastes resources).

The two autoscalers operate on different dimensions:

  • HPA manages the number of pods (scaling out/in)
  • VPA manages the size of each pod’s memory (scaling up/down)

This separation prevents conflicts and ensures both performance and efficiency.

The limit acts as a safety ceiling for VPA operations. VPA will never exceed this limit, so setting it too low may prevent your service from getting needed memory during peak usage.

resources:
limits:
memory: "2Gi" # Generous limit provides headroom for VPA

Unlike memory limits, unused CPU reservations cannot be reclaimed by other services. Start with a lower value and let HPA handle load spikes by adding pods.

resources:
requests:
cpu: "100m" # Conservative starting point

Set appropriate boundaries based on your service characteristics:

scalability:
min_instances: 2 # Lower for dev environments
max_instances: 50 # Higher for critical services
  1. Initial Deployment: Your service starts with the configured resource requests and minimum instance count.

  2. Memory Adjustment: VPA observes actual memory usage and adjusts the memory request for new pods (during deployments or scaling events). The memory limit you configured acts as a ceiling.

  3. Load-based Scaling: When CPU usage exceeds the target (e.g., 70%), HPA adds more pods. When CPU usage drops, HPA gradually removes excess pods.

  4. Continuous Optimization: The system continuously monitors and adjusts, ensuring your service has the resources it needs without waste.

  • VPA only adjusts memory when pods restart naturally (deployments, scaling events)
  • Your configured memory limit is never exceeded
  • Adjustments are based on observed usage patterns over time
  • New pods are added when average CPU exceeds the target
  • Pods are removed after sustained low CPU usage
  • Minimum instance count is always maintained
  • Memory limits prevent runaway memory growth
  • Minimum instances ensure availability
  • Maximum instances prevent excessive scaling

If your service uses significant memory:

  1. Set a generous memory limit (e.g., 4Gi)
  2. Let VPA find the optimal request value
  3. Monitor for Out-of-Memory (OOM) errors

If your service is CPU-bound:

  1. Set an appropriate CPU utilization target (e.g., 60% for latency-sensitive services)
  2. Ensure adequate maximum instances
  3. Consider the trade-off between pod count and pod size

For services with varying load:

  1. Set minimum instances to handle baseline traffic
  2. Set maximum instances to handle peak load
  3. HPA will automatically scale between these boundaries

If you suspect scaling issues:

  1. Check pod count: Verify HPA is scaling within configured boundaries
  2. Check memory usage: Ensure pods aren’t hitting memory limits
  3. Contact Runway team: Escalate to #g_runway Slack channel for assistance

The Runway team can:

  • Review VPA recommendations
  • Adjust scaling parameters
  • Disable VPA in case of an active incident

This scaling strategy follows Kubernetes best practices and is similar to Google’s MultidimPodAutoscaler used in GKE Autopilot. The pattern of using HPA for CPU and VPA for memory is well-established in production environments.

Manual resource allocation typically results in significant over-provisioning. The 2025 Kubernetes Cost Benchmark Report found that across 4000 Kubernetes clusters, the average CPU utilization is only 10% and the average memory utilization is only 23%. This means organizations are paying for resources that sit idle 77-90% of the time. Automatic scaling ensures resources are used efficiently across the cluster.

Rather than requiring teams to continuously monitor and adjust resource allocations, Runway handles this automatically. This reduces operational overhead while improving resource utilization.

The configuration uses conservative settings to prioritize stability while achieving efficiency gains.

  • VPA only adjusts memory during natural pod restarts – never evicting running pods.
  • The memory limit provides a safety ceiling that prevents runaway growth, e.g. due to a memory leak.
  • HPA responds gradually to load changes rather than aggressively scaling, preventing unnecessary churn in your pod count.

This approach ensures your service remains stable while the platform optimizes resource usage behind the scenes.

Runway’s automatic scaling provides:

  • Performance: Automatic response to load changes via HPA
  • Efficiency: Right-sized memory allocations via VPA
  • Simplicity: No manual tuning required
  • Safety: Conservative adjustments with configured boundaries

Focus on setting appropriate memory limits (be generous) and CPU requests (be conservative), and let Runway handle the rest. For any scaling-related issues, contact the Runway team in #g_runway.