Automatic Scaling
Automatic Scaling in Runway
Section titled “Automatic Scaling in Runway”Runway automatically manages the scaling of your services to ensure optimal performance and resource efficiency. This document explains how automatic scaling works and how you can configure it for your service.
How Runway Handles Scaling
Section titled “How Runway Handles Scaling”Runway uses two complementary scaling mechanisms that work together to optimize your service:
Horizontal Scaling (Pod Count)
Section titled “Horizontal Scaling (Pod Count)”Horizontal Pod Autoscaler (HPA) automatically adjusts the number of pod replicas based on CPU utilization. When your service experiences high load, HPA creates additional pods to handle the traffic. When load decreases, it removes unnecessary pods.
What you control:
- Minimum instances: The lowest number of pods (default: 3)
- Maximum instances: The upper limit for scaling (default: 20)
- CPU utilization target: The desired CPU usage percentage (default: 70%)
Vertical Scaling (Resource Allocation)
Section titled “Vertical Scaling (Resource Allocation)”Vertical Pod Autoscaler (VPA) automatically adjusts memory requests and limits for your pods based on actual usage patterns. It observes your service’s memory consumption over time and right-sizes the allocation to prevent waste while ensuring adequate resources.
What you control:
- Memory limit: The maximum memory your service can use (default: 1Gi)
- This serves as a safety ceiling for VPA operations
Configuration
Section titled “Configuration”Configure scaling in your .runway/${RUNWAY_SERVICE_ID}/default-values.yaml
file:
spec: services: frontend: # Horizontal scaling configuration scalability: min_instances: 3 # Minimum number of pods max_instances: 20 # Maximum number of pods cpu_utilization: 70 # Target CPU percentage for scaling
# Resource configuration (affects vertical scaling) resources: requests: cpu: "500m" # Initial CPU request memory: "512Mi" # Initial memory request (VPA will adjust) limits: cpu: "1000m" # CPU limit memory: "1Gi" # Memory limit (VPA ceiling)
Key Concepts
Section titled “Key Concepts”CPU-based Horizontal Scaling
Section titled “CPU-based Horizontal Scaling”CPU utilization directly correlates with request load. When your service processes more requests, CPU usage increases, triggering HPA to add more pods. This provides immediate capacity for handling traffic spikes.
Memory-based Vertical Scaling
Section titled “Memory-based Vertical Scaling”Memory usage is typically stable and workload-specific. VPA optimizes memory allocation to match your service’s actual needs, preventing both under-provisioning (which causes crashes) and over-provisioning (which wastes resources).
Working Together
Section titled “Working Together”The two autoscalers operate on different dimensions:
- HPA manages the number of pods (scaling out/in)
- VPA manages the size of each pod’s memory (scaling up/down)
This separation prevents conflicts and ensures both performance and efficiency.
Configuration Guidelines
Section titled “Configuration Guidelines”Memory Limits
Section titled “Memory Limits”The limit acts as a safety ceiling for VPA operations. VPA will never exceed this limit, so setting it too low may prevent your service from getting needed memory during peak usage.
resources: limits: memory: "2Gi" # Generous limit provides headroom for VPA
CPU Requests
Section titled “CPU Requests”Unlike memory limits, unused CPU reservations cannot be reclaimed by other services. Start with a lower value and let HPA handle load spikes by adding pods.
resources: requests: cpu: "100m" # Conservative starting point
Scaling Boundaries
Section titled “Scaling Boundaries”Set appropriate boundaries based on your service characteristics:
scalability: min_instances: 2 # Lower for dev environments max_instances: 50 # Higher for critical services
How It Works in Practice
Section titled “How It Works in Practice”-
Initial Deployment: Your service starts with the configured resource requests and minimum instance count.
-
Memory Adjustment: VPA observes actual memory usage and adjusts the memory request for new pods (during deployments or scaling events). The memory limit you configured acts as a ceiling.
-
Load-based Scaling: When CPU usage exceeds the target (e.g., 70%), HPA adds more pods. When CPU usage drops, HPA gradually removes excess pods.
-
Continuous Optimization: The system continuously monitors and adjusts, ensuring your service has the resources it needs without waste.
Important Behaviors
Section titled “Important Behaviors”Memory Adjustments
Section titled “Memory Adjustments”- VPA only adjusts memory when pods restart naturally (deployments, scaling events)
- Your configured memory limit is never exceeded
- Adjustments are based on observed usage patterns over time
CPU Scaling
Section titled “CPU Scaling”- New pods are added when average CPU exceeds the target
- Pods are removed after sustained low CPU usage
- Minimum instance count is always maintained
Protection Mechanisms
Section titled “Protection Mechanisms”- Memory limits prevent runaway memory growth
- Minimum instances ensure availability
- Maximum instances prevent excessive scaling
Common Scenarios
Section titled “Common Scenarios”Memory-Intensive Services
Section titled “Memory-Intensive Services”If your service uses significant memory:
- Set a generous memory limit (e.g.,
4Gi
) - Let VPA find the optimal request value
- Monitor for Out-of-Memory (OOM) errors
CPU-Intensive Services
Section titled “CPU-Intensive Services”If your service is CPU-bound:
- Set an appropriate CPU utilization target (e.g.,
60%
for latency-sensitive services) - Ensure adequate maximum instances
- Consider the trade-off between pod count and pod size
Variable Load Patterns
Section titled “Variable Load Patterns”For services with varying load:
- Set minimum instances to handle baseline traffic
- Set maximum instances to handle peak load
- HPA will automatically scale between these boundaries
Troubleshooting
Section titled “Troubleshooting”If you suspect scaling issues:
- Check pod count: Verify HPA is scaling within configured boundaries
- Check memory usage: Ensure pods aren’t hitting memory limits
- Contact Runway team: Escalate to
#g_runway
Slack channel for assistance
The Runway team can:
- Review VPA recommendations
- Adjust scaling parameters
- Disable VPA in case of an active incident
Background: Why This Approach?
Section titled “Background: Why This Approach?”Industry Best Practices
Section titled “Industry Best Practices”This scaling strategy follows Kubernetes best practices and is similar to Google’s MultidimPodAutoscaler
used in GKE Autopilot.
The pattern of using HPA for CPU and VPA for memory is well-established in production environments.
Resource Efficiency
Section titled “Resource Efficiency”Manual resource allocation typically results in significant over-provisioning. The 2025 Kubernetes Cost Benchmark Report found that across 4000 Kubernetes clusters, the average CPU utilization is only 10% and the average memory utilization is only 23%. This means organizations are paying for resources that sit idle 77-90% of the time. Automatic scaling ensures resources are used efficiently across the cluster.
Operational Simplicity
Section titled “Operational Simplicity”Rather than requiring teams to continuously monitor and adjust resource allocations, Runway handles this automatically. This reduces operational overhead while improving resource utilization.
Safety First
Section titled “Safety First”The configuration uses conservative settings to prioritize stability while achieving efficiency gains.
- VPA only adjusts memory during natural pod restarts – never evicting running pods.
- The memory limit provides a safety ceiling that prevents runaway growth, e.g. due to a memory leak.
- HPA responds gradually to load changes rather than aggressively scaling, preventing unnecessary churn in your pod count.
This approach ensures your service remains stable while the platform optimizes resource usage behind the scenes.
Summary
Section titled “Summary”Runway’s automatic scaling provides:
- Performance: Automatic response to load changes via HPA
- Efficiency: Right-sized memory allocations via VPA
- Simplicity: No manual tuning required
- Safety: Conservative adjustments with configured boundaries
Focus on setting appropriate memory limits (be generous) and CPU requests (be conservative), and let Runway handle the rest.
For any scaling-related issues, contact the Runway team in #g_runway
.