Skip to content

Runway Observability

Runway supports observability for a service by integrating with the monitoring stack. This includes both service-level metrics and load balancer observability across AWS, GKE, and Cloud Run environments.

Runway provides unified load balancer observability across all cloud environments using normalized runway_lb_* metrics.

The observability pipeline uses provider-native exporters with OpenTelemetry normalization:

GCP (GKE):

  • Stackdriver Exporter deployed to designated clusters via ArgoCD
  • Collects metrics from Cloud Load Balancing API
  • OTel Gateway normalizes to runway_lb_* schema

AWS:

  • CloudWatch Exporter deployed to EKS clusters via ArgoCD
  • Collects metrics from CloudWatch API
  • OTel Gateway normalizes to runway_lb_* schema

All metrics are exported to Mimir with X-Scope-OrgID: runway.

MetricDescriptionLabels
runway_lb_request_countTotal requests to load balancerruntime, env, load_balancer (AWS) / forwarding_rule_name (GCP)
runway_lb_backend_latency_millisecondsBackend response timeruntime, env, load_balancer (AWS) / forwarding_rule_name (GCP), statistic (AWS only: average/minimum/maximum)
MetricDescriptionLabels
runway_lb_backend_request_countRequests reaching backendsruntime, env, forwarding_rule_name
runway_lb_total_latency_millisecondsEnd-to-end latency (proxy to client)runtime, env, forwarding_rule_name
MetricDescriptionLabels
runway_lb_response_code_countRequests by HTTP status classruntime, env, load_balancer, response_code_class (2xx/4xx/5xx)
runway_lb_backend_latency_millisecondsBackend latency with statisticsruntime, env, load_balancer, statistic (average/minimum/maximum)
LabelValuesDescription
runtimeeks, gkeCloud environment
envproduction, staging (GKE/EKS) / gprd, gstg (Cloud Run)Runway environment
load_balancerstringAWS ALB/NLB name (AWS only)
forwarding_rule_namestringGCP forwarding rule name (GKE/CloudRun)
statisticaverage, minimum, maximumLatency statistic (AWS only)
response_code_class2xx, 4xx, 5xxHTTP status code class (AWS only)

Request rate by runtime:

sum by (runtime) (rate(runway_lb_request_count{env="production"}[5m]))

Backend latency p99 across all clouds:

histogram_quantile(
0.99,
sum by (le, runtime) (
rate(runway_lb_backend_latency_milliseconds_bucket{env="production"}[5m])
)
)

Request drop rate by runtime (GKE only - EKS does not have backend_request_count):

sum by (runtime) (
rate(runway_lb_request_count{env="production", runtime="gke"}[5m])
) - sum by (runtime) (
rate(runway_lb_backend_request_count{env="production", runtime="gke"}[5m])
)

Request rate by HTTP status class:

sum by (response_code_class) (
rate(runway_lb_response_code_count{runtime="eks", env="production"}[5m])
)

Backend latency statistics:

runway_lb_backend_latency_milliseconds{runtime="eks", env="production", statistic=~"average|maximum"}

Total latency p99 (GKE):

histogram_quantile(
0.99,
sum by (le, forwarding_rule_name) (
rate(runway_lb_total_latency_milliseconds_bucket{runtime="gke", env="production"}[5m])
)
)

The following informational dashboards are available in Grafana for collectively viewing load balancer stats across all Runway services and runtimes:


Service-level metrics for Kubernetes (GKE and EKS) services are collected via OpenTelemetry collectors deployed to each cluster.

Runway Kubernetes services can be integrated with the runbooks observability stack to get out-of-the-box:

  • Service overview Grafana dashboard — apdex, error rate, RPS, saturation panels
  • SLO violation alerts — apdex, error rate, and traffic cessation
  • Kubernetes saturation alerts — CPU, memory, HPA utilization

For full background on the metrics catalog, refer to the metrics-catalog README.

Step 1 — Add a service catalog entry

Add your service to services/service-catalog.yml in the runbooks repository. Ensure primary_grafana_dashboard points to <type>-main/<type>-overview.

Step 2 — Create a metrics catalog entry

Create metrics-catalog/services/my-service-gke.jsonnet using runway-k8s-archetype:

metrics-catalog/services/my-service-gke.jsonnet
local k8sArchetype = import 'service-archetypes/runway-k8s-archetype.libsonnet';
local metricsCatalog = import 'servicemetrics/metrics.libsonnet';
metricsCatalog.serviceDefinition(
k8sArchetype(
type='my-service-gke', // must match your runway_service_id
team='my_team',
featureCategory='my_feature_category',
runtime='gke',
)
)

Step 3 — Register in all.jsonnet

Add an import to metrics-catalog/services/all.jsonnet:

import 'my-service-gke.jsonnet',

Step 4 — Generate and commit

Terminal window
make generate

Commit all generated files. After the MR is merged, your dashboard will be available at https://dashboards.gitlab.net/d/my-service-gke-main.

ParameterDescriptionDefault
typeService type — must match your runway_service_idrequired
teamOwning team for alert routing — see valid teamsrequired
featureCategoryGitLab feature category for alert routingnot_owned
runtimegke or both (EKS support is TBD)both
apdexScoreApdex SLO threshold (ratio of requests meeting latency target)0.999
errorRatioError SLO threshold (ratio of requests completing without error)0.999

The archetype automatically generates a frontend LB SLI using a url_map_name regex derived from your service type:

gkegw1-[^-]+-<type>-.*

This works for most services. If it does not match, find your exact url_map_name values by querying the Mimir - Runway datasource in Grafana Explore:

count by (url_map_name) (runway_lb_request_count{runtime="gke"})

Then override lbSelector explicitly:

k8sArchetype(
type='my-service-gke',
team='my_team',
runtime='gke',
lbSelector={
url_map_name: { oneOf: [
'gkegw1-l52v-my-service-gke-...', // staging
'gkegw1-ltbu-my-service-gke-...', // production
] },
},
)

Runway supports exposing custom application metrics via Prometheus, collected by OTel and pushed to Mimir. These can then be referenced as custom SLIs in the metrics catalog alongside the default LB-level SLIs.

Configure Prometheus scrape targets in your default-values.yaml:

.runway/my-service-gke/default-values.yaml
spec:
observability:
scrape_targets:
- "localhost:8082"

This creates a Kubernetes Service named metrics with ports matching your targets. The cluster-level OpenTelemetry Gateway collector automatically discovers and scrapes services with ports named metrics-*.

For Go applications, we recommend using Labkit to expose Prometheus metrics. Labkit provides:

  • labkit/metrics - HTTP handler instrumentation with request count and latency histograms
  • labkit/monitoring - Monitoring server that serves the /metrics endpoint

See the example-service frontend for a complete implementation:

import (
"gitlab.com/gitlab-org/labkit/metrics"
"gitlab.com/gitlab-org/labkit/monitoring"
)
func serveMetrics() {
opts := []monitoring.Option{
monitoring.WithListenerAddress(":8082"),
}
monitoring.Start(opts...)
}

Custom metrics are available in the Mimir - Runway datasource in Grafana.

Example query:

http_requests_total{env="production", kubernetes_namespace="my-service-gke"}

Once your metrics are available in Mimir, you can add them as SLIs in the metrics catalog:

metrics-catalog/services/my-service-gke.jsonnet
local k8sArchetype = import 'service-archetypes/runway-k8s-archetype.libsonnet';
local metricsCatalog = import 'servicemetrics/metrics.libsonnet';
local rateMetric = metricsCatalog.rateMetric;
local type = 'my-service-gke';
local featureCategory = 'my_feature_category';
metricsCatalog.serviceDefinition(
k8sArchetype(
type=type,
team='my_team',
featureCategory=featureCategory,
runtime='gke',
)
// Custom application-level SLIs on top of archetype defaults
{
serviceLevelIndicators+: {
my_server: {
userImpacting: true,
featureCategory: featureCategory,
requestRate: rateMetric(
counter='http_requests_total',
selector={ type: type },
),
errorRate: rateMetric(
counter='http_requests_total',
selector={ type: type, status: '5xx' },
),
significantLabels: ['handler', 'status'],
},
},
}
)

See the metrics-catalog README for more details on defining SLIs.

LabelDescriptionExample
envEnvironmentstaging, production
cloud_providerCloud providergcp, aws
cloud_runtimeKubernetes runtimegke, eks
cloud_regionDeployment regionus-east1, us-east-1
k8s_cluster_nameCluster namerunway-gke-gprd-us-east1
kubernetes_namespaceKubernetes namespacemy-service-gke

Service-level observability via the metrics catalog is supported for Cloud Run services.

To get a service overview dashboard and SLO alerts:

  1. Create a new entry in the service catalog in the expected format.
  2. Create a new entry in the metrics catalog:
metrics-catalog/services/my-service.jsonnet
local runwayArchetype = import 'service-archetypes/runway-archetype.libsonnet';
local metricsCatalog = import 'servicemetrics/metrics.libsonnet';
metricsCatalog.serviceDefinition(
runwayArchetype(
type='my_service',
team='my_team',
)
)
  1. Run make generate and commit all generated content.

After approval and merging, you can view the newly generated service overview dashboard.

By default, a dashboard is generated with:

  • Default SLIs (e.g. runway_ingress)
  • Default Saturation Details (e.g. runway_container_memory_utilization)

The dashboard is checked into version control and can be extended with custom SLIs. Optionally, you can use the general Runway Service Metrics dashboard.

Default metrics are reported under the stackdriver_cloud_run_* namespace in Mimir, even without a service catalog entry:

stackdriver_cloud_run_revision_run_googleapis_com_request_count{job="runway-exporter",env="gprd",service_name="my_service"}

To learn more, refer to Cloud Run metrics documentation.

Custom metrics can be reported using Prometheus text-based exposition format. When scrape targets are present, Runway deploys a sidecar OpenTelemetry Collector preconfigured to scrape your ingress container at the specified port(s):

spec:
observability:
scrape_targets:
- "localhost:8082"
metrics_path: "/foo" # defaults to /metrics

These custom metrics will be available under the Mimir - Runway data source in Grafana.

To learn more, refer to the Prometheus exporters documentation.


Alerts are generated automatically for any service with a metrics catalog entry. By default the following SLO violation alerts are created:

  • Apdex SLO violation
  • Error SLO violation
  • Traffic absent / traffic cessation

Alerts are routed via Alertmanager to Slack and incident.io.

  • Alerts default to the #feed_alerts-general Slack channel — this channel is very noisy. It is strongly recommended to route alerts to a channel you actively monitor.
  • Only S1 or S2 severity alerts page the on-call SRE. S4 (default) alerts are reported to Slack only.
  • To route alerts to a team Slack channel, specify a valid team in your metrics catalog entry.
  • For full routing configuration, refer to the alert routing documentation.

To override alert thresholds, set the following fields in your metrics catalog entry:

OptionDescriptionDefault
apdexScoreApdex SLO threshold0.999
errorRatioError SLO threshold0.999
severityAlert severity (s1s4) — S1/S2 pages on-call SREs4

Before setting S1 or S2 severity, your service must complete a production readiness review.


Runway application logs are available in Grafana via ClickHouse. You can query logs by filtering on ServiceName (your runway_service_id). Please refer to the Logging documentation.