Skip to content

Cloud SQL for PostgreSQL

Summary

Runway currently supports stateless services and stateful services with a cache (GCP Memorystore for Redis). Candidate services have requested support for stateful services with a persistent database. While stateless services were ideal for the first iteration, limitations are now preventing new services from onboarding and existing services from increasing adoption.

As part of strategy to support stateful services, the next logical iteration is application database instances. Propose database instance capabilities using GCP Cloud SQL for PostgreSQL to allow service owners to self-serve fully-managed cloud infrastructure.

Motivation

The importance and urgency for PostgreSQL features is to remove key blockers for candidate services. Most notably, CustomersDot (note that Cloud SQL is only one of the few blocking requirements) and openbao.

Many competing PaaS offerings include features for PostgreSQL instances. As a result, Runway has opportunities to provide PostgreSQL instances as part of the workloads deployed using Runway.

Goals

  • Service owners can self-serve Cloud SQL for PostgreSQL instances through the provisioner
  • Service owners can securely connect Runway services to existing Cloud SQL instances
  • Compatibility with GCP Cloud Run Services
  • Runway Workloads deployed to multiple regions must be able to connect to Cloud SQL instances
  • Compatibility with GCP Cloud Run Jobs
  • Compatibility with GCP Google Kubernetes Engine (note: stretch goal, in progress)

Non-Goals

  • Support read replicas (reason: most potential customers and experimental services are not at the scale to need read replica)
  • Support GCP Cloud SQL for MySQL. GitLab uses PostgreSQL as it’s OLTP database. Runway offers only PostgreSQL as the paved-path.
  • Support disaster recovery against regional failures (reason: cross-region replica is overly complex and not used even for patroni-main)
  • Support AWS RDS (reason: not for the first iteration of the blueprint)
  • Support exporting database dumps to object storage (reason: rely on GCP Cloud SQL backups and restore as the paved road for starters, to re-evaluate in the future if there are strong use cases)
  • Long-term support for external Cloud SQL databases not managed by Runway. (reason: we will only support Runway workload’s secure connectivity with such instances to facilitate migrations into Runway-managed Cloud SQL instances)

Proposal

Propose managing the entire lifecycle of PostgreSQL instance capabilities, so service owners can self-serve using Runway.

Postgres architecture

Provision an instance

Service owners must be able to create and manage database instance. In Provisioner, inventory.yml allows users to configure their PostgreSQL instances which is used internally by Provisioner to manage the resources.

Configure an instance

Runway will be responsible for exposing configurable options in Provisioner. Configuration options will be cloud-agnostic and PostgreSQL-compaitble. Such options include instance type, disk size, additional databases, additional users.

Service owners must be able to configure database instances. Service owners should not necessarily know that their Runway service is running on CloudSQL, only that it is PostgreSQL compatible.

Connect to an instance

Service owners must be able to connect to database instance from their service. In Provisioner, database instance connection name is written into the service’s path using Vault, e.g. runway/reconciler/<service>/datastores/<env>/<region>. In Reconciler, make database instance credentials accessible to service using secrets available as environment variables.

For the initial offering of Runway-managed Cloud SQL, only built-in authentication is available. Secret rotation will only be carried out in the event of security incidents. Service owners will have to wait until IAM authentication is available after Runway Platform V2 is up which will enable us to embrace the password-less authentication approach.

Service owners must be able to connect to multiple instances from a service, which is a common scenario when decomposing instances.

Secure an instance

Service owners must be able to securely connect only to instances available in their service’s secrets management. In Provisioner, update service accounts to use predefined roles offered by GCP Cloud SQL.

Service owners must the Cloud SQL Auth Proxy which is the preferred way of securely connecting to a Cloud SQL instance. This will be enabled by default for all services and cannot be opted out.

When connecting to an unmanaged Cloud SQL instance (i.e., not provisioned by Runway), a Private Service Connect endpoint must be used to access the Cloud SQL instance in the external project’s VPC network.

Monitor an instance

Service owners must be able to monitor database instances. Runway will use runway_exporter for Stackdriver’s Cloud SQL metrics. In Runbooks, use monitoring and capacity planning for GCP Cloud SQL for PostgreSQL.

In the future (when the GKE cluster supports metric exporters), Runway will create new Runway postgres_exporter for each Runway database instance to scrape metrics using Prometheus.

Deprovision an instance

Service owners must be able to disconnect database instances from their service. Service owners must be able to destroy database instances.

Instances will have deletion protection enabled by default. Deletions will require a change management issue to be prepared and reviewed by SREs before deletion can happen.

Pricing for an instance

Service owners must be able to be attribute costs of database instances. In Provisioner, attach standard resource labels to Cloud SQL instances using GitLab Infrastructure standards.

Historically, Runway’s operating cost has been negligible due to GCP Cloud Run pricing. By introducing support for database instance capabilities, this is no longer the case. For contrast, an enterprise plus Cloud SQL instance on a db-perf-optimized-N-2 machine with HA and Data Cache enabled in us-central1 would cost ~$520/month (calculated using the GCP calculator).

Pricing is based on components for tier, capacity, region, and replicas. When accessing a Cloud SQL instance from a Cloud Run Service client in a different region, GCP charges you for network egress traffic from Cloud SQL instances to your client application for total GB transferred from one region to the other.

The xsmall and small machine types should be used for development and testing, and medium and above machine types instances should be used for GA features.

Backup for an instance

Runway must configure Cloud SQL instances to maintain regular backup and allow Service owners to override the default backup schedule.

Restoring an instance

Runway will be responsible for monitoring backups and providing tooling/documentation for performing restores. Runway will also periodically test backups by restoring to another pre-provisioned instance like how CustomerDot Cloud SQL is periodically verified with a scheduled pipeline.

Service owners need to configure the periodic backup testing by providing information such as table name, the table’s owner name, query string, etc.

Service owners must be able to restore their database instance in the case of an outage or an accidental deletion. They may require the help of SREs if they do not have sufficient permissions.

Performing database schema and data migration

Service owners must be able to use Runway Jobs to trigger database schema and data migrations. Runway Jobs (Cloud Run Jobs) must be able to connect to the Cloud SQL instance.

Design and implementation details

Fully-managed

As a fully-managed solution in GCP, scalability, availability, and maintenance are considered features.

Scalability

Enterprise instances can be vertically scaled up to a maximum of 96 vCPU and 638976 MB memory in the case of a db-custom-96-638976 machine. Enterprise Plus instances can be vertically scaled up to a maximum 128 vCPU and 864 GB memory for db-perf-optimized-N-128 machine type.

Based on self-managed Patroni instances for GitLab.com, CPU utilization saturation resource has been primary bottleneck. GCP Cloud SQL instances can be configured to automatically increase storage.

Availability

The Enterprise tier offers 99.95% uptime SLA and <30s maintenance downtimes. The Enterprise Plus tier offers 99.99% uptime SLA and near zero downtimes for several planned operations.

High availability can be configured by setting the availability_type as REGIONAL so that a primary instance and a standby instance are created in separate zones.

Maintenance

Instances can enable maintenance policy that is routinely scheduled.

Runway will be responsible for defining a maintenance window for all services by default. Service owners will be responsible for optionally overriding default maintenance policy depending on preference.

Runway will be responsible for using capacity planning process and maintenance notifications annotations to ensure system memory utilization is rightsized for maintenance windows depending on workload instance traffic.

Service owners will be responsible for exponential backoff to handle client re-connections after maintenance failover in non-LabKit supported programming languages. Runway will be responsible for implementing PostgresSQL client functionality in LabKit supported programming languages, on a just-in-time case-by-case basis.

Cloud SQL will upgrade the minor version of the database during maintenance windows. New versions are supported within 30 days of general availability. CloudSQL doesn’t automatically upgrade to a newer major version. Service owners should be able to perform In-Place Major Version upgrades by changing the version of their instance using a merge request on provisioner. SRE/DBRE support may be required. Important to notice that CloudSQL Major Version Upgrades require downtime.

Disaster Recovery

Unlike Redis, PostgreSQL database is likely to be storing data to be persisted. Runway-managed Cloud SQL instances will be configured with default backup configuration and point-in-time-recovery enabled. Deletion protection must be enabled for all database instances.

A runbook guide for restoring an instance must be available for Runway platform owners and SREs to use in the event of an outage. For planned high-risk CRs, steps for manual PITR snapshots should be available in the runbook guide and be referenced in the change issue.

Data Model

Workload is a workload in GCP Cloud Run (e.g. service, job). Instance is a PostgreSQL instance of GCP Cloud SQL for PostgreSQL. Workload must be able to connect to multiple Instances (e.g. like GitLab with main/ci/sec/embedding). Instance must be able to be provisioned and configured to one or more environments (e.g. staging, production).

To limit initial complexity, we only permit either a 1:1 workload-to-database or n:1 workloads-to-database relationships to allow for scenarios where a service read/writes to a database and a job handles schema migrations.

1:n workload-to-databases relationships will not be allowed initially to limit the maintenance toil on the Runway platform. Service owners should instead create additional logical databases within a physical instance. Database decomposition can be requested when there are actual scalability concerns.

Terraform

In Provisioner, Runway must manage IaC for google_sql_database_instance resource.

Infrastructure has an existing Terraform module to provision GCP Cloud SQL instances. Not all Cloud SQL instances in config-mgmt are using this module, with some instances like sentry, engineering productivity and ops using the Google Cloud SQL Terraform Module. Right now, the GitLab-managed module is based on an old fork so it does not make sense to extend it any further. The terraform-google-sql-db project also supports terraform modules for restore and backup.

JSON Schema

In Provisioner, Runway must include JSON Schema for database instances.

Below is an illustrative examples of provisioning and configuring an instance.

postgres_instances:
- name: runway-db-example
identifier: EXAMPLE
provider: GCP
db_version: "POSTGRES_16"
region: us-east1
zone: us-east1-b
database_configuration:
additional_databases:
- name: runway
additional_users:
- name: job
env_configuration:
staging:
instance_type: small
disk_size: 5
backup_configuration:
transaction_log_retention_days: 2
retained_backups_count: 3
production:
instance_type: medium # xsmall/small/medium/large/xlarge/2xlarge/3xlarge
disk_size: 10
backup_configuration:
transaction_log_retention_days: 2
retained_backups_count: 3
labels:
owner_email_handle: scalability-team
department: eng-infra
department_group: eng-infra-scalability
product_category: scalability

Additionally, service inventory will be extended to connect a service to instances. Here’s an illustrative example:

---
inventory:
- name: cloudrun-service
project_id: 123456789
postgres_instances:
- cloudrun-db
postgres_instances:
- name: cloudrun-db
provider: GCP
...

As the examples demonstrate, YAML Schema should be flexible enough to support provisioning and configuring any supported attributes, regardless of instance type.

On the Reconciler, the runway*.yml needs to be configured using spec.cloud_providers.gcp.cloudsql_instances like below:

# https://gitlab-com.gitlab.io/gl-infra/platform/runway/runwayctl/manifest.schema.html
apiVersion: runway/v1
kind: RunwayService # or RunwayJob
metadata:
# omitted
spec:
cloud_providers:
gcp:
cloudsql_instances:
- psc_enabled: false # Runway-managed Cloud SQL instance
instance_connection_name: gitlab-runway-staging:us-central1:runway-db-example
- psc_enabled: true # unmanaged Cloud SQL instance
instance_connection_name: gitlab-staging:us-central1:runway-db-external

Integration

Vault

In Provisioner, Runway must store database instance credentials, so it can be accessed by Reconciler.

Runway currently uses Vault for secrets management, which is workload and runtime agnostic. As a result, Vault will be the mechanism to connect to an instance.

Secrets for an instance will be stored under the following:

MountPath
runwayreconciler/$WORKLOAD/datastores/$ENV/$REGION/

Fields for an instance will include the following:

FieldDescription
Instance Connection NameThe connection name of the master instance to be used in connection strings.
DatabaseThe name of the PG database to connect to.

As part of a paved road approach, a default user, password, and database is created for each new Cloud SQL instances.

The service owner experience should be very similar to competing SaaS offerings, e.g. Heroku Redis add-ons. Secrets will be prefixed with RUNWAY_DB_$IDENTIFIER_<DEFAULT/additional identifier>_$FIELD, e.g. RUNWAY_DB_TEMP_DEFAULT_USER. Additional databases will be RUNWAY_DB_TEMP_YYY_DB_NAME where YYY is the database name. Additional users will be stored as RUNWAY_DB_TEMP_XXX_USER and RUNWAY_DB_TEMP_XXX_PWD where XXX is the username.

Workloads connected to external Cloud SQL instances can configure these fields using the env-$ENVIRONMENT.yml file for non-sensitive configurations and vault path env/$ENV/service/$WORKLOAD/ for secrets.

As Cloud Run instance connections to managed Cloud SQL instances will use IAM authentication, no passwords will be read by the Runway workload.

Cloud SQL Auth Proxy

When creating a GCP Cloud SQL instance, Runway will enable connection enforcement by default and cannot be opted out.

Both Cloud Run workloads and GKE applications can connect to Cloud SQL instances through an auth proxy sidecar.

VPC

To connect to GCP Cloud SQL from GCP Cloud Run service/job, Runway must prepare direct vpc egress. We can re-use prior art from Cloud Run connection to Memorystore instances as Cloud Run has vpc access configured. By configuring the Runway manifest for the Runway workloads with Cloud SQL instance connection details, the vpc access will be enabled automatically.

Note on future-proofing: for GKE, when connecting using private IP, the cluster VPC-native and peered with the same Virtual Private Cloud (VPC) network as the Cloud SQL instance.

IAM

In Provisioner, Cloud SQL Admin and Compute Viewer roles are required for managing instances and GCP Cloud Run service accounts should be updated with the roles/cloudsql.client role for the project containing the GCP Cloud SQL instance.

For Runway workloads with external Cloud SQL instances, the crun-$WORKLOAD@gitlab-runway-$ENV.iam.gserviceaccount.com service account must be granted the roles/cloudsql.client and roles/cloudsql.instanceUser role in the GCP project where the Cloud SQL instance resides.

Metrics

Runway must offer SLIs, SLOs, dashboards, and saturation montioring for Cloud SQL instances. In Runbooks, introduce a Runway Database archetype for Runway Cloud SQL instances.

Like with Runway Redis, we will scrape Stackdriver metrics using runway exporter until the GKE cluster is up with metric exporters.

Constraints

In addition to non-goals, the following is unsupported:

  • GCP Cloud SQL data migration
  • GCP Cloud SQL detailed performance analysis (e.g. eBPF probes, tcpdump traffic analysis, etc)

Alternative Solutions

  1. Directly connect to Cloud SQL without the auth proxy.
  2. Take a similar approach as Auto Devops by using a single DATABASE_URL string.
  3. Deploy postgres_exporter instead of the interim workaround to use stackdriver metrics.