Runway Jobs
Summary
Runway currently only supports the deployment of stateless services. A popular feature request is the ability to run jobs (one-off or scheduled) on Runway. Stateful services may require periodic maintenance tasks, such as compaction or pruning, that may need to run periodically and should not be on the service’s serving path.
The Cloud Run runtime supports maintenance tasks via the Cloud Run Jobs feature. By tapping into this offering, Runway will be a more viable option for teams to deploy their workloads as they will no longer be restricted to just stateless services.
Some references:
- https://kubernetes.io/docs/concepts/workloads/controllers/job/
- https://cloud.google.com/run/docs/create-jobs
Motivation
There are prospective use-cases which Runway as a platform does not fulfil because Runway currently only supports stateless services.
Topology Service needs to run jobs for applying schema changes while the ai-framework group has raised jobs as an essential component still missing from Runway.
Goals
- To let service owners define periodically scheduled jobs in Runway.
- To let service owners define and trigger maintenance jobs in Runway. Maintenance jobs run as a separate process that does not handle user traffic.
Non-Goals
-
Not to provide support for jobs as part of a service deployment pipeline.
Rationale: needlessly couples job to service deployment.
-
Not to provide service owners a mean to trigger jobs out of a workflow.
Rationale: the job pipeline should be triggered through the service project’s CI pipeline.
-
Not to provide support for background job processing.
Rationale: the functionality can be supported through cloud tasks or Cloud Run services.
Proposal
Currently, we only support deploying a single type of workload via Runway, which we’ve colloquially referred
to as Runway services. With this proposal, we are adding support for jobs so we would like to introduce and
encourage the use of the term Runway workload
in place of Runway service
as a service
is an overloaded term.
We define a Runway workload
as a deployable object defined in inventory.yml
. A workload has its own set of
infrastructure resource for deployment such as:
- deployment repository
- service accounts for reconciler prefixed with
rcr-*
and Cloud Run prefixed withcrun-*
- artifact repository
- vault path
Each workload runs in a deployment project and is deployed independently through an upstream service project trigger.
Currently, we only support a single Runway workload type (a service), but we are proposing to introduce a new class
of workload type: job
.
A job
may be configured to run as either:
- on-demand — triggered by service owners via CI (see below)
- scheduled jobs — triggered by Cloud Run on a schedule defined in the manifest (see below)
Design and implementation details
Defining jobs in the service repository CI YAML
Each Runway job is defined as a single element in the include
array. The runway_service_id
would correspond to
the deployment repository name.
For example, if a service owner wants to add 2 jobs, they can be defined as such:
Both of these workloads would need to be defined in the provisioner’s inventory.yml
file.
Introducing RunwayJob
We will introduce a new kind: RunwayJob
. A runway.yml
config will look like this:
command
and args
would map to arguments for the cloud_run_v2_job
.
schedule
(UTC) will be used in google_cloud_scheduler_job
to trigger the Cloud Run job using http_target
as outlined in Google Cloud guide. We will allow a single schedule to be configured for each job. If we find that users would like to configure multiple schedules, then we can iterate on it.
While the cron format is not the most human-friendly format to parse, it is fairly common and most developers are familiar with it. There are websites like crontab.guru that help with the generation of cron expressions.
CI pipelines
The various workload types would run the stages:
- Service:
preflight checks -> deploy -> monitor
- Job:
preflight checks -> deploy
To trigger a one-off RunwayJob
, users can define the CI job that extends
a pre-defined .execute-job
, adding environment variables if required.
Reconciler changes
The workload information can be passed into runwayctl
as an argument, read using os.Getenv
or read from kind
in the runway yml.
On the reconciler
’s end, we refactor service-related resources into a service
module. A new job
module would be added.
Directory/reconciler
Directorymodules
Directoryinternal-loadbalancer/
- …
Directoryexternal-loadbalancer/
- …
Directoryservice/ # refactored from
reconciler
- …
Directoryjob/ # new module
- …
- main.tf
For the job
workload, runwayctl
would need a new command/subcommand to invoke RunJob
as
the terraform resource only creates the job.
Monitoring
The completed_execution_count
and completed_task_attempt_count
deltas are useful metrics exposed
for monitoring purposes. This allows us to detect cron failures and notify service owners.
Alternative Solutions
-
Provision a service account for maintenance tasks and provide the credentials to service owners. This would allow them to build their own maintenance system.
This would give service owners a higher degree of flexibility in running their own maintenance systems. However, this goes against Runway’s principles of building a PaaS for teams to deploy their workloads. Furthermore, this would require service owners to be well-versed in deploying such maintenance tasks.
-
Allow service owners to specify arbitrary Terraform modules as add-ons, which they could then use to set up Cloud Run Jobs, databases, and anything else they wish.
Instead of adding jobs-functionality into runway, we provide users with ability to extend the deployment process by hooking into terraform. This provides service owners with a high degree of flexibility like option (1) above. Likewise, it shares similar drawbacks of requiring users to be be well-versed in deploying such tasks and puts the responsibility of maintenance on them.
-
Do nothing and declare services that require maintenance jobs to be out of scope for Runway.
This may deter future users from onboarding services onto Runway. It also adds additional operational toil for existing Runway users who need to manage their maintenance job separately from their Runway services.