Runway Observability
Runway supports observability for a service by integrating with the monitoring stack. This includes both service-level metrics and load balancer observability across AWS, GKE, and Cloud Run environments.
Load Balancer Observability
Section titled “Load Balancer Observability”Runway provides unified load balancer observability across all cloud environments using normalized runway_lb_* metrics.
Architecture
Section titled “Architecture”The observability pipeline uses provider-native exporters with OpenTelemetry normalization:
GCP (GKE):
- Stackdriver Exporter deployed to designated clusters via ArgoCD
- Collects metrics from Cloud Load Balancing API
- OTel Gateway normalizes to
runway_lb_*schema
AWS:
- CloudWatch Exporter deployed to EKS clusters via ArgoCD
- Collects metrics from CloudWatch API
- OTel Gateway normalizes to
runway_lb_*schema
All metrics are exported to Mimir with X-Scope-OrgID: runway.
Available Metrics
Section titled “Available Metrics”Core Metrics (All Runtimes)
Section titled “Core Metrics (All Runtimes)”| Metric | Description | Labels |
|---|---|---|
runway_lb_request_count | Total requests to load balancer | runtime, env, load_balancer (AWS) / forwarding_rule_name (GCP) |
runway_lb_backend_latency_milliseconds | Backend response time | runtime, env, load_balancer (AWS) / forwarding_rule_name (GCP), statistic (AWS only: average/minimum/maximum) |
GKE-Specific Metrics
Section titled “GKE-Specific Metrics”| Metric | Description | Labels |
|---|---|---|
runway_lb_backend_request_count | Requests reaching backends | runtime, env, forwarding_rule_name |
runway_lb_total_latency_milliseconds | End-to-end latency (proxy to client) | runtime, env, forwarding_rule_name |
AWS-Specific Metrics
Section titled “AWS-Specific Metrics”| Metric | Description | Labels |
|---|---|---|
runway_lb_response_code_count | Requests by HTTP status class | runtime, env, load_balancer, response_code_class (2xx/4xx/5xx) |
runway_lb_backend_latency_milliseconds | Backend latency with statistics | runtime, env, load_balancer, statistic (average/minimum/maximum) |
Label Reference
Section titled “Label Reference”| Label | Values | Description |
|---|---|---|
runtime | eks, gke | Cloud environment |
env | production, staging (GKE/EKS) / gprd, gstg (Cloud Run) | Runway environment |
load_balancer | string | AWS ALB/NLB name (AWS only) |
forwarding_rule_name | string | GCP forwarding rule name (GKE/CloudRun) |
statistic | average, minimum, maximum | Latency statistic (AWS only) |
response_code_class | 2xx, 4xx, 5xx | HTTP status code class (AWS only) |
Query Examples
Section titled “Query Examples”Cross-Cloud Queries
Section titled “Cross-Cloud Queries”Request rate by runtime:
sum by (runtime) (rate(runway_lb_request_count{env="production"}[5m]))Backend latency p99 across all clouds:
histogram_quantile( 0.99, sum by (le, runtime) ( rate(runway_lb_backend_latency_milliseconds_bucket{env="production"}[5m]) ))Request drop rate by runtime (GKE only - EKS does not have backend_request_count):
sum by (runtime) ( rate(runway_lb_request_count{env="production", runtime="gke"}[5m])) - sum by (runtime) ( rate(runway_lb_backend_request_count{env="production", runtime="gke"}[5m]))AWS-Specific Queries
Section titled “AWS-Specific Queries”Request rate by HTTP status class:
sum by (response_code_class) ( rate(runway_lb_response_code_count{runtime="eks", env="production"}[5m]))Backend latency statistics:
runway_lb_backend_latency_milliseconds{runtime="eks", env="production", statistic=~"average|maximum"}GKE-Specific Queries
Section titled “GKE-Specific Queries”Total latency p99 (GKE):
histogram_quantile( 0.99, sum by (le, forwarding_rule_name) ( rate(runway_lb_total_latency_milliseconds_bucket{runtime="gke", env="production"}[5m]) ))Dashboards
Section titled “Dashboards”The following informational dashboards are available in Grafana for collectively viewing load balancer stats across all Runway services and runtimes:
- Runway Load Balancer Metrics - Main - Unified cross-cloud dashboard with runtime comparison
- Runway Load Balancer Metrics - EKS - AWS ALB/NLB specific metrics
- Runway Load Balancer Metrics - GKE - GKE load balancer metrics
- Runway Load Balancer Metrics - CloudRun - Cloud Run load balancer metrics
Service Observability (Kubernetes)
Section titled “Service Observability (Kubernetes)”Service-level metrics for Kubernetes (GKE and EKS) services are collected via OpenTelemetry collectors deployed to each cluster.
Dashboards & Alerts
Section titled “Dashboards & Alerts”Runway Kubernetes services can be integrated with the runbooks observability stack to get out-of-the-box:
- Service overview Grafana dashboard — apdex, error rate, RPS, saturation panels
- SLO violation alerts — apdex, error rate, and traffic cessation
- Kubernetes saturation alerts — CPU, memory, HPA utilization
For full background on the metrics catalog, refer to the metrics-catalog README.
Step 1 — Add a service catalog entry
Add your service to services/service-catalog.yml in the runbooks repository. Ensure primary_grafana_dashboard points to <type>-main/<type>-overview.
Step 2 — Create a metrics catalog entry
Create metrics-catalog/services/my-service-gke.jsonnet using runway-k8s-archetype:
local k8sArchetype = import 'service-archetypes/runway-k8s-archetype.libsonnet';local metricsCatalog = import 'servicemetrics/metrics.libsonnet';
metricsCatalog.serviceDefinition( k8sArchetype( type='my-service-gke', // must match your runway_service_id team='my_team', featureCategory='my_feature_category', runtime='gke', ))Step 3 — Register in all.jsonnet
Add an import to metrics-catalog/services/all.jsonnet:
import 'my-service-gke.jsonnet',Step 4 — Generate and commit
make generateCommit all generated files. After the MR is merged, your dashboard will be available at https://dashboards.gitlab.net/d/my-service-gke-main.
Configuration Reference
Section titled “Configuration Reference”| Parameter | Description | Default |
|---|---|---|
type | Service type — must match your runway_service_id | required |
team | Owning team for alert routing — see valid teams | required |
featureCategory | GitLab feature category for alert routing | not_owned |
runtime | gke or both (EKS support is TBD) | both |
apdexScore | Apdex SLO threshold (ratio of requests meeting latency target) | 0.999 |
errorRatio | Error SLO threshold (ratio of requests completing without error) | 0.999 |
Frontend LB Selector (lbSelector)
Section titled “Frontend LB Selector (lbSelector)”The archetype automatically generates a frontend LB SLI using a url_map_name regex derived from your service type:
gkegw1-[^-]+-<type>-.*This works for most services. If it does not match, find your exact url_map_name values by querying the Mimir - Runway datasource in Grafana Explore:
count by (url_map_name) (runway_lb_request_count{runtime="gke"})Then override lbSelector explicitly:
k8sArchetype( type='my-service-gke', team='my_team', runtime='gke', lbSelector={ url_map_name: { oneOf: [ 'gkegw1-l52v-my-service-gke-...', // staging 'gkegw1-ltbu-my-service-gke-...', // production ] }, },)Custom Metrics
Section titled “Custom Metrics”Runway supports exposing custom application metrics via Prometheus, collected by OTel and pushed to Mimir. These can then be referenced as custom SLIs in the metrics catalog alongside the default LB-level SLIs.
Configuration
Section titled “Configuration”Configure Prometheus scrape targets in your default-values.yaml:
spec: observability: scrape_targets: - "localhost:8082"This creates a Kubernetes Service named metrics with ports matching your targets. The cluster-level OpenTelemetry Gateway collector automatically discovers and scrapes services with ports named metrics-*.
Best Practice: Use Labkit
Section titled “Best Practice: Use Labkit”For Go applications, we recommend using Labkit to expose Prometheus metrics. Labkit provides:
labkit/metrics- HTTP handler instrumentation with request count and latency histogramslabkit/monitoring- Monitoring server that serves the/metricsendpoint
See the example-service frontend for a complete implementation:
import ( "gitlab.com/gitlab-org/labkit/metrics" "gitlab.com/gitlab-org/labkit/monitoring")
func serveMetrics() { opts := []monitoring.Option{ monitoring.WithListenerAddress(":8082"), } monitoring.Start(opts...)}Accessing Metrics
Section titled “Accessing Metrics”Custom metrics are available in the Mimir - Runway datasource in Grafana.
Example query:
http_requests_total{env="production", kubernetes_namespace="my-service-gke"}Using Custom Metrics as SLIs
Section titled “Using Custom Metrics as SLIs”Once your metrics are available in Mimir, you can add them as SLIs in the metrics catalog:
local k8sArchetype = import 'service-archetypes/runway-k8s-archetype.libsonnet';local metricsCatalog = import 'servicemetrics/metrics.libsonnet';local rateMetric = metricsCatalog.rateMetric;
local type = 'my-service-gke';local featureCategory = 'my_feature_category';
metricsCatalog.serviceDefinition( k8sArchetype( type=type, team='my_team', featureCategory=featureCategory, runtime='gke', ) // Custom application-level SLIs on top of archetype defaults { serviceLevelIndicators+: { my_server: { userImpacting: true, featureCategory: featureCategory, requestRate: rateMetric( counter='http_requests_total', selector={ type: type }, ), errorRate: rateMetric( counter='http_requests_total', selector={ type: type, status: '5xx' }, ), significantLabels: ['handler', 'status'], }, }, })See the metrics-catalog README for more details on defining SLIs.
Available Labels
Section titled “Available Labels”| Label | Description | Example |
|---|---|---|
env | Environment | staging, production |
cloud_provider | Cloud provider | gcp, aws |
cloud_runtime | Kubernetes runtime | gke, eks |
cloud_region | Deployment region | us-east1, us-east-1 |
k8s_cluster_name | Cluster name | runway-gke-gprd-us-east1 |
kubernetes_namespace | Kubernetes namespace | my-service-gke |
Service Observability (Cloud Run)
Section titled “Service Observability (Cloud Run)”Service-level observability via the metrics catalog is supported for Cloud Run services.
Dashboards & Alerts
Section titled “Dashboards & Alerts”To get a service overview dashboard and SLO alerts:
- Create a new entry in the service catalog in the expected format.
- Create a new entry in the metrics catalog:
local runwayArchetype = import 'service-archetypes/runway-archetype.libsonnet';local metricsCatalog = import 'servicemetrics/metrics.libsonnet';
metricsCatalog.serviceDefinition( runwayArchetype( type='my_service', team='my_team', ))- Run
make generateand commit all generated content.
After approval and merging, you can view the newly generated service overview dashboard.
By default, a dashboard is generated with:
- Default SLIs (e.g.
runway_ingress) - Default Saturation Details (e.g.
runway_container_memory_utilization)
The dashboard is checked into version control and can be extended with custom SLIs. Optionally, you can use the general Runway Service Metrics dashboard.
Default Metrics
Section titled “Default Metrics”Default metrics are reported under the stackdriver_cloud_run_* namespace in Mimir, even without a service catalog entry:
stackdriver_cloud_run_revision_run_googleapis_com_request_count{job="runway-exporter",env="gprd",service_name="my_service"}To learn more, refer to Cloud Run metrics documentation.
Custom Metrics
Section titled “Custom Metrics”Custom metrics can be reported using Prometheus text-based exposition format. When scrape targets are present, Runway deploys a sidecar OpenTelemetry Collector preconfigured to scrape your ingress container at the specified port(s):
spec: observability: scrape_targets: - "localhost:8082" metrics_path: "/foo" # defaults to /metricsThese custom metrics will be available under the Mimir - Runway data source in Grafana.
To learn more, refer to the Prometheus exporters documentation.
Alerts
Section titled “Alerts”Alerts are generated automatically for any service with a metrics catalog entry. By default the following SLO violation alerts are created:
- Apdex SLO violation
- Error SLO violation
- Traffic absent / traffic cessation
Routing
Section titled “Routing”Alerts are routed via Alertmanager to Slack and incident.io.
- Alerts default to the
#feed_alerts-generalSlack channel — this channel is very noisy. It is strongly recommended to route alerts to a channel you actively monitor. - Only S1 or S2 severity alerts page the on-call SRE. S4 (default) alerts are reported to Slack only.
- To route alerts to a team Slack channel, specify a valid
teamin your metrics catalog entry. - For full routing configuration, refer to the alert routing documentation.
Configuration
Section titled “Configuration”To override alert thresholds, set the following fields in your metrics catalog entry:
| Option | Description | Default |
|---|---|---|
apdexScore | Apdex SLO threshold | 0.999 |
errorRatio | Error SLO threshold | 0.999 |
severity | Alert severity (s1–s4) — S1/S2 pages on-call SRE | s4 |
Before setting S1 or S2 severity, your service must complete a production readiness review.
Runway application logs are available in Grafana via ClickHouse. You can query logs by filtering on ServiceName (your runway_service_id). Please refer to the Logging documentation.