Automatic Scaling

Automatic Scaling in Runway

Runway automatically manages the scaling of your services to ensure optimal performance and resource efficiency. This document explains how automatic scaling works and how you can configure it for your service.

How Runway Handles Scaling

Runway uses two complementary scaling mechanisms that work together to optimize your service:

Horizontal Scaling (Pod Count)

Horizontal Pod Autoscaler (HPA) automatically adjusts the number of pod replicas based on CPU utilization. When your service experiences high load, HPA creates additional pods to handle the traffic. When load decreases, it removes unnecessary pods.

What you control:

Minimum instances: The lowest number of pods (default: 3)
Maximum instances: The upper limit for scaling (default: 20)
CPU utilization target: The desired CPU usage percentage (default: 70%)

Vertical Scaling (Resource Allocation)

Vertical Pod Autoscaler (VPA) automatically adjusts memory requests and limits for your pods based on actual usage patterns. It observes your service’s memory consumption over time and right-sizes the allocation to prevent waste while ensuring adequate resources.

What you control:

Memory limit: The maximum memory your service can use (default: 1Gi)
This serves as a safety ceiling for VPA operations

Configuration

Configure scaling in your .runway/${RUNWAY_SERVICE_ID}/default-values.yaml file:

spec:
  # Horizontal scaling configuration
  scalability:
    min_instances: 3        # Minimum number of pods
    max_instances: 20       # Maximum number of pods
    cpu_utilization: 70     # Target CPU percentage for scaling

  # Resource configuration (affects vertical scaling)
  resources:
    requests:
      cpu: "500m"           # Initial CPU request
      memory: "512Mi"       # Initial memory request (VPA will adjust)
    limits:
      cpu: "1000m"          # CPU limit
      memory: "1Gi"         # Memory limit (VPA ceiling)

Key Concepts

CPU-based Horizontal Scaling

CPU utilization directly correlates with request load. When your service processes more requests, CPU usage increases, triggering HPA to add more pods. This provides immediate capacity for handling traffic spikes.

Memory-based Vertical Scaling

Memory usage is typically stable and workload-specific. VPA optimizes memory allocation to match your service’s actual needs, preventing both under-provisioning (which causes crashes) and over-provisioning (which wastes resources).

Working Together

The two autoscalers operate on different dimensions:

HPA manages the number of pods (scaling out/in)
VPA manages the size of each pod’s memory (scaling up/down)

This separation prevents conflicts and ensures both performance and efficiency.

Configuration Guidelines

Memory Limits

The limit acts as a safety ceiling for VPA operations. VPA will never exceed this limit, so setting it too low may prevent your service from getting needed memory during peak usage.

resources:
  limits:
    memory: "2Gi"  # Generous limit provides headroom for VPA

CPU Requests

Unlike memory limits, unused CPU reservations cannot be reclaimed by other services. Start with a lower value and let HPA handle load spikes by adding pods.

resources:
  requests:
    cpu: "100m"    # Conservative starting point

Scaling Boundaries

Set appropriate boundaries based on your service characteristics:

scalability:
  min_instances: 2   # Lower for dev environments
  max_instances: 50  # Higher for critical services

How It Works in Practice

Initial Deployment: Your service starts with the configured resource requests and minimum instance count.
Memory Adjustment: VPA observes actual memory usage and adjusts the memory request for new pods (during deployments or scaling events). The memory limit you configured acts as a ceiling.
Load-based Scaling: When CPU usage exceeds the target (e.g., 70%), HPA adds more pods. When CPU usage drops, HPA gradually removes excess pods.
Continuous Optimization: The system continuously monitors and adjusts, ensuring your service has the resources it needs without waste.

Important Behaviors

Memory Adjustments

VPA only adjusts memory when pods restart naturally (deployments, scaling events)
Your configured memory limit is never exceeded
Adjustments are based on observed usage patterns over time

CPU Scaling

New pods are added when average CPU exceeds the target
Pods are removed after sustained low CPU usage
Minimum instance count is always maintained

Protection Mechanisms

Memory limits prevent runaway memory growth
Minimum instances ensure availability
Maximum instances prevent excessive scaling

Common Scenarios

Memory-Intensive Services

If your service uses significant memory:

Set a generous memory limit (e.g., 4Gi)
Let VPA find the optimal request value
Monitor for Out-of-Memory (OOM) errors

CPU-Intensive Services

If your service is CPU-bound:

Set an appropriate CPU utilization target (e.g., 60% for latency-sensitive services)
Ensure adequate maximum instances
Consider the trade-off between pod count and pod size

Variable Load Patterns

For services with varying load:

Set minimum instances to handle baseline traffic
Set maximum instances to handle peak load
HPA will automatically scale between these boundaries

Troubleshooting

If you suspect scaling issues:

Check pod count: Verify HPA is scaling within configured boundaries
Check memory usage: Ensure pods aren’t hitting memory limits
Contact Runway team: Escalate to #g_runway Slack channel for assistance

The Runway team can:

Review VPA recommendations
Adjust scaling parameters
Disable VPA in case of an active incident

Background: Why This Approach?

Industry Best Practices

This scaling strategy follows Kubernetes best practices and is similar to Google’s MultidimPodAutoscaler used in GKE Autopilot. The pattern of using HPA for CPU and VPA for memory is well-established in production environments.

Resource Efficiency

Manual resource allocation typically results in significant over-provisioning. The 2025 Kubernetes Cost Benchmark Report found that across 4000 Kubernetes clusters, the average CPU utilization is only 10% and the average memory utilization is only 23%. This means organizations are paying for resources that sit idle 77-90% of the time. Automatic scaling ensures resources are used efficiently across the cluster.

Operational Simplicity

Rather than requiring teams to continuously monitor and adjust resource allocations, Runway handles this automatically. This reduces operational overhead while improving resource utilization.

Safety First

The configuration uses conservative settings to prioritize stability while achieving efficiency gains.

VPA only adjusts memory during natural pod restarts – never evicting running pods.
The memory limit provides a safety ceiling that prevents runaway growth, e.g. due to a memory leak.
HPA responds gradually to load changes rather than aggressively scaling, preventing unnecessary churn in your pod count.

This approach ensures your service remains stable while the platform optimizes resource usage behind the scenes.

Summary

Runway’s automatic scaling provides:

Performance: Automatic response to load changes via HPA
Efficiency: Right-sized memory allocations via VPA
Simplicity: No manual tuning required
Safety: Conservative adjustments with configured boundaries

Focus on setting appropriate memory limits (be generous) and CPU requests (be conservative), and let Runway handle the rest. For any scaling-related issues, contact the Runway team in #g_runway.