Runway Observability

Runway supports observability for a service by integrating with monitoring stack.

Setup

Right now, prerequisite for observability is service catalog entry. Follow these steps:

Create new entry in service catalog in expected format: e.g. my_service.

Create new entry in metrics catalog: e.g.

local runwayArchetype = import 'service-archetypes/runway-archetype.libsonnet';
local metricsCatalog = import 'servicemetrics/metrics.libsonnet';

metricsCatalog.serviceDefinition(
  runwayArchetype(
    type='my_service',
    team='my_team',
  )
)

Run make generate and commit autogenerated content

After approval and merging, you can view newly generated service overview dashboard.

Metrics

By default, metrics are reported for a service even if a service catalog entry does not exist yet. Optionally, you can report custom metrics for a service.

Default

Default metrics are reported under stackdriver_cloud_run_* metric namespace in Mimir, e.g.:

stackdriver_cloud_run_revision_run_googleapis_com_request_count{job="runway-exporter",env="gprd",service_name="my_service"}

To learn more, refer to documentation.

Custom Metrics

Cloud Run

Custom metrics can be reported using Prometheus text-based exposition format. When scrape targets are present, Runway will deploy sidecar container for OpenTelemetry Collector preconfigured to automatically scrape ingress container at your specified port(s). To enable configuration:

# omitted for brevity
spec:
  observability:
    scrape_targets:
      - "localhost:8082"

These custom metrics will be available under the Mimir - Runway data source in Grafana.

To learn more, refer to documentation.

Dashboards

By default, a dashboard is generated for a service with the following service overview panels:

Default SLIs (e.g. runway_ingress)
Default Saturation Details (e.g. runway_container_memory_utilization)

The dashboard is checked into version control and can be extended with custom SLIs. Optionally, you can use general Runway Service Metrics dashboard.

To learn more, refer to documentation.

Alerts

By default, alerts are generated for a service with the following SLOs:

Apdex SLO violation
Error SLO violation
Traffic absent SLO violation

To override the default configuration, set the following fields in metrics catalog entry:

Option	Description	Default
`apdexSatisfiedThreshold`	Alter expected request latency of the Runway service	`1024` ms
`apdexScore`	Alter apdex threshold for the Runway service	`0.999`
`errorScore`	Alter how many errors are tolerated for the Runway service	`0.999`

For routing, you must specify a valid team in metrics catalog entry.

To learn more, refer to documentation.

Logs

As a short-term solution, Runway application container logs can be viewed in Cloud Logging UI by filtering resource.labels.service_name to your runway_service_id.

As a long-term solution, Runway application container logs will adopt Observability team’s standardized solution.

To learn more, refer to documentation.

How this works in practice - Guided Example

Secret Detection Service (Runway-managed) is in the service-catalog and metrics catalog. Two example questions:

Are the alerts listed here the ones that are monitored for SLO violations? If yes, then what are the alerts defined in https://alerts.gitlab.net?
Since the service borrows Runway’s SLI defaults, does it mean that in the event of an SLO violation, an incident issue is raised with a severity S4? If not, what triggers an active incident issue and a pager duty alert?

alerts will be present for all SLO violations, to see what alert-rules have been added you can use the alerts page in Grafana. Use the search box for filtering for the service in the search box.
They will not show up on alertmanager (alerts.gitlab.net) until an alert fires.
Alertmanager is what then decides what to do with the alert: post it to slack, page an SRE to pagerduty, etc.
Because the default SLIs have S4 severity, they will not page for an SLO violation, but just get reported to Slack for now. You can reroute those requests to an alert channel of your choice: https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/uncategorized/alert-routing.md
Incident issues are usually raised by the oncall SRE when they are paged. But feel free to raise one yourself if you need assistance. Read more about reporting an incident in the handbook.

Runway has all environments in one tenant. We separate them with the evironment label. So that could be environment="gprd" or environment="gstg". stage is the label we use for differentiating canary and the main stage. So those labels could be stage="main" or stage="cny". Though runway services do not have a canary stage. Practically that will only be stage="main".

All alerts go to the #feed_alerts-general channel by default. Which makes that channel very noisy. Because of the noisy channel, it is highly recommended to route the alerts you’re interested in to a channel that you will monitor. Only alerts marked as S1 or S2 will page the on-call SRE. So setting that on your service would do that. Before doing that, the service should go through a readiness review: https://handbook.gitlab.com/handbook/engineering/infrastructure/production/readiness/