Skip to content

Runway Jobs

Summary

Runway currently only supports the deployment of stateless services. A popular feature request is the ability to run jobs (one-off or scheduled) on Runway. Stateful services may require periodic maintenance tasks, such as compaction or pruning, that may need to run periodically and should not be on the service’s serving path.

The Cloud Run runtime supports maintenance tasks via the Cloud Run Jobs feature. By tapping into this offering, Runway will be a more viable option for teams to deploy their workloads as they will no longer be restricted to just stateless services.

Some references:

Motivation

There are prospective use-cases which Runway as a platform does not fulfil because Runway currently only supports stateless services.

Topology Service needs to run jobs for applying schema changes while the ai-framework group has raised jobs as an essential component still missing from Runway.

Goals

  • To let service owners define periodically scheduled jobs in Runway.
  • To let service owners define and trigger maintenance jobs in Runway. Maintenance jobs run as a separate process that does not handle user traffic.

Non-Goals

  • Not to provide support for jobs as part of a service deployment pipeline.

    Rationale: needlessly couples job to service deployment.

  • Not to provide service owners a mean to trigger jobs out of a workflow.

    Rationale: the job pipeline should be triggered through the service project’s CI pipeline.

  • Not to provide support for background job processing.

    Rationale: the functionality can be supported through cloud tasks or Cloud Run services.

Proposal

Currently, we only support deploying a single type of workload via Runway, which we’ve colloquially referred to as Runway services. With this proposal, we are adding support for jobs so we would like to introduce and encourage the use of the term Runway workload in place of Runway service as a service is an overloaded term.

We define a Runway workload as a deployable object defined in inventory.yml. A workload has its own set of infrastructure resource for deployment such as:

  • deployment repository
  • service accounts for reconciler prefixed with rcr-* and Cloud Run prefixed with crun-*
  • artifact repository
  • vault path

Each workload runs in a deployment project and is deployed independently through an upstream service project trigger.

Currently, we only support a single Runway workload type (a service), but we are proposing to introduce a new class of workload type: job.

A job may be configured to run as either:

  1. on-demand — triggered by service owners via CI (see below)
  2. scheduled jobs — triggered by Cloud Run on a schedule defined in the manifest (see below)

Design and implementation details

Job architecture

Defining jobs in the service repository CI YAML

Each Runway job is defined as a single element in the include array. The runway_service_id would correspond to the deployment repository name.

For example, if a service owner wants to add 2 jobs, they can be defined as such:

include:
...
- project: 'gitlab-com/gl-infra/platform/runway/runwayctl'
file: 'ci-tasks/service-project/runway.yml'
inputs:
runway_service_id: topo-svc-schema-change-job
image: "$CI_REGISTRY_IMAGE/deploy:$CI_COMMIT_SHORT_SHA"
runway_version: v2.21.0
- project: 'gitlab-com/gl-infra/platform/runway/runwayctl'
file: 'ci-tasks/service-project/runway.yml'
inputs:
runway_service_id: topo-svc-compactor-cronjob
image: "$CI_REGISTRY_IMAGE/deploy:$CI_COMMIT_SHORT_SHA"
runway_version: v2.21.0

Both of these workloads would need to be defined in the provisioner’s inventory.yml file.

Introducing RunwayJob

We will introduce a new kind: RunwayJob. A runway.yml config will look like this:

apiVersion: runway/v1
kind: RunwayJob
metadata:
name: topology-service-compactor
spec:
region: us-central1
command: ["bundle"]
args: ["exec", "rake", "compactor:execute"]
resources:
# omitted for brevity
schedule: "0 * * * *" # cron format
# other common keys like scalability, observability are omitted for brevity

command and args would map to arguments for the cloud_run_v2_job.

schedule (UTC) will be used in google_cloud_scheduler_job to trigger the Cloud Run job using http_target as outlined in Google Cloud guide. We will allow a single schedule to be configured for each job. If we find that users would like to configure multiple schedules, then we can iterate on it.

While the cron format is not the most human-friendly format to parse, it is fairly common and most developers are familiar with it. There are websites like crontab.guru that help with the generation of cron expressions.

CI pipelines

The various workload types would run the stages:

  • Service: preflight checks -> deploy -> monitor
  • Job: preflight checks -> deploy

To trigger a one-off RunwayJob, users can define the CI job that extends a pre-defined .execute-job, adding environment variables if required.

runwayctl/ci-tasks/service-project/impl.yml
.execute-job:
trigger:
project: "gitlab-com/gl-infra/platform/runway/deployments/${RUNWAY_SERVICE_ID}"
branch: main
strategy: depend
needs: 🛫 [$[[ inputs.runway_service_id ]]] Trigger runway deployment production
variables:
RUNWAY_EXECUTE_JOB: true # used in CI job rules to skip unrelated jobs
# service-project/.gitlab-ci.yml
Run compact job:
extends: .execute-job
variables:
JOB: compact
Run schema change:
extends: .execute-job
variables:
JOB: schema_change

Reconciler changes

The workload information can be passed into runwayctl as an argument, read using os.Getenv or read from kind in the runway yml.

On the reconciler’s end, we refactor service-related resources into a service module. A new job module would be added.

  • Directory/reconciler
    • Directorymodules
      • Directoryinternal-loadbalancer/
      • Directoryexternal-loadbalancer/
      • Directoryservice/ # refactored from reconciler
      • Directoryjob/ # new module
    • main.tf

For the job workload, runwayctl would need a new command/subcommand to invoke RunJob as the terraform resource only creates the job.

Monitoring

The completed_execution_count and completed_task_attempt_count deltas are useful metrics exposed for monitoring purposes. This allows us to detect cron failures and notify service owners.

Alternative Solutions

  1. Provision a service account for maintenance tasks and provide the credentials to service owners. This would allow them to build their own maintenance system.

    This would give service owners a higher degree of flexibility in running their own maintenance systems. However, this goes against Runway’s principles of building a PaaS for teams to deploy their workloads. Furthermore, this would require service owners to be well-versed in deploying such maintenance tasks.

  2. Allow service owners to specify arbitrary Terraform modules as add-ons, which they could then use to set up Cloud Run Jobs, databases, and anything else they wish.

    Instead of adding jobs-functionality into runway, we provide users with ability to extend the deployment process by hooking into terraform. This provides service owners with a high degree of flexibility like option (1) above. Likewise, it shares similar drawbacks of requiring users to be be well-versed in deploying such tasks and puts the responsibility of maintenance on them.

  3. Do nothing and declare services that require maintenance jobs to be out of scope for Runway.

    This may deter future users from onboarding services onto Runway. It also adds additional operational toil for existing Runway users who need to manage their maintenance job separately from their Runway services.