Runway Observability
Runway supports observability for a service by integrating with monitoring stack.
Setup
Right now, prerequisite for observability is service catalog entry. Follow these steps:
- Create new entry in service catalog: e.g.
my_service
. - Create new entry in metrics catalog: e.g.
- Run
make generate
and commit autogenerated content
After approval and merging, you can view newly generated service overview dashboard.
Metrics
By default, metrics are reported for a service even if a service catalog entry does not exist yet. Optionally, you can report custom metrics for a service.
Default
Default metrics are reported under stackdriver_cloud_run_*
metric namespace in Mimir, e.g.:
To learn more, refer to documentation.
Custom
Custom metrics can be reported using Prometheus text-based exposition format. When scrape targets are present, Runway will deploy sidecar container for OpenTelemetry Collector preconfigured to automatically scrape ingress container at your specified port(s). To enable configuration:
To learn more, refer to documentation.
Dashboards
By default, a dashboard is generated for a service with the following service overview panels:
- Default SLIs (e.g.
runway_ingress
) - Default Saturation Details (e.g.
runway_container_memory_utilization
)
The dashboard is checked into version control and can be extended] with custom SLIs. Optionally, you can use general Runway Service Metrics dashboard.
To learn more, refer to documentation.
Alerts
By default, alerts are generated for a service with the following SLOs:
- Apdex SLO violation
- Error SLO violation
- Traffic absent SLO violation
To override the default configuration, set the following fields in metrics catalog entry:
Option | Description | Default |
---|---|---|
apdexSatisfiedThreshold | Alter expected request latency of the Runway service | 1024 ms |
apdexScore | Alter apdex threshold for the Runway service | 0.999 |
errorScore | Alter how many errors are tolerated for the Runway service | 0.999 |
For routing, you must specify a valid team
in metrics catalog entry.
To learn more, refer to documentation.
Logs
As a short-term solution, Runway application container logs can be viewed in Cloud Logging UI by filtering resource.labels.service_name
to your runway_service_id
.
As a long-term solution, Runway application container logs will adopt Observability team’s standardized solution.
To learn more, refer to documentation.
How this works in practice - Guided Example
Secret Detection Service (Runway-managed) is in the service-catalog and metrics catalog. Two example questions:
- Are the alerts listed here the ones that are monitored for SLO violations? If yes, then what are the alerts defined in https://alerts.gitlab.net?
- Since the service borrows Runway’s SLI defaults, does it mean that in the event of an SLO violation, an incident issue is raised with a severity S4? If not, what triggers an active incident issue and a pager duty alert?
- alerts will be present for all SLO violations, to see what alert-rules have been added you can use the alerts page in Grafana. Use the search box for filtering for the service in the search box.
- They will not show up on alertmanager (alerts.gitlab.net) until an alert fires.
- Alertmanager is what then decides what to do with the alert: post it to slack, page an SRE to pagerduty, etc.
- Because the default SLIs have S4 severity, they will not page for an SLO violation, but just get reported to Slack for now. You can reroute those requests to an alert channel of your choice: https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/uncategorized/alert-routing.md
- Incident issues are usually raised by the oncall SRE when they are paged. But feel free to raise one yourself if you need assistance. Read more about reporting an incident in the handbook.
Runway has all environments in one tenant. We separate them with the evironment
label. So that could be environment="gprd"
or environment="gstg"
. stage
is the label we use for differentiating canary and the main stage. So those labels could be stage="main"
or stage="cny"
. Though runway services do not have a canary stage. Practically that will only be stage="main"
.
All alerts go to the #feed_alerts-general channel by default. Which makes that channel very noisy. Because of the noisy channel, it is highly recommended to route the alerts you’re interested in to a channel that you will monitor. Only alerts marked as S1 or S2 will page the on-call SRE. So setting that on your service would do that. Before doing that, the service should go through a readiness review: https://handbook.gitlab.com/handbook/engineering/infrastructure/production/readiness/