Cloud SQL for PostgreSQL
Summary
Runway currently supports stateless services and stateful services with a cache (GCP Memorystore for Redis). Candidate services have requested support for stateful services with a persistent database. While stateless services were ideal for the first iteration, limitations are now preventing new services from onboarding and existing services from increasing adoption.
As part of strategy to support stateful services, the next logical iteration is application database instances. Propose database instance capabilities using GCP Cloud SQL for PostgreSQL to allow service owners to self-serve fully-managed cloud infrastructure.
Motivation
The importance and urgency for PostgreSQL features is to remove key blockers for candidate services. Most notably, CustomersDot (note that Cloud SQL is only one of the few blocking requirements) and openbao.
Many competing PaaS offerings include features for PostgreSQL instances. As a result, Runway has opportunities to provide PostgreSQL instances as part of the workloads deployed using Runway.
Goals
- Service owners can self-serve Cloud SQL for PostgreSQL instances through the provisioner
- Service owners can securely connect Runway services to existing Cloud SQL instances
- Compatibility with GCP Cloud Run Services
- Runway Workloads deployed to multiple regions must be able to connect to Cloud SQL instances
- Compatibility with GCP Cloud Run Jobs
- Compatibility with GCP Google Kubernetes Engine (note: stretch goal, in progress)
Non-Goals
- Support read replicas (reason: most potential customers and experimental services are not at the scale to need read replica)
- Support GCP Cloud SQL for MySQL. GitLab uses PostgreSQL as it’s OLTP database. Runway offers only PostgreSQL as the paved-path.
- Support disaster recovery against regional failures (reason: cross-region replica is overly complex and not used even for patroni-main)
- Support AWS RDS (reason: not for the first iteration of the blueprint)
- Support exporting database dumps to object storage (reason: rely on GCP Cloud SQL backups and restore as the paved road for starters, to re-evaluate in the future if there are strong use cases)
- Long-term support for external Cloud SQL databases not managed by Runway. (reason: we will only support Runway workload’s secure connectivity with such instances to facilitate migrations into Runway-managed Cloud SQL instances)
Proposal
Propose managing the entire lifecycle of PostgreSQL instance capabilities, so service owners can self-serve using Runway.
Provision an instance
Service owners must be able to create and manage database instance. In Provisioner, inventory.yml
allows users to configure
their PostgreSQL instances which is used internally by Provisioner to manage the resources.
Configure an instance
Runway will be responsible for exposing configurable options in Provisioner. Configuration options will be cloud-agnostic and PostgreSQL-compaitble. Such options include instance type, disk size, additional databases, additional users.
Service owners must be able to configure database instances. Service owners should not necessarily know that their Runway service is running on CloudSQL, only that it is PostgreSQL compatible.
Connect to an instance
Service owners must be able to connect to database instance from their service. In Provisioner, database instance
connection name is written into the service’s path using Vault, e.g. runway/reconciler/<service>/datastores/<env>/<region>
. In
Reconciler, make database instance credentials accessible to service using
secrets available as environment variables.
For the initial offering of Runway-managed Cloud SQL, only built-in authentication is available. Secret rotation will only be carried out in the event of security incidents. Service owners will have to wait until IAM authentication is available after Runway Platform V2 is up which will enable us to embrace the password-less authentication approach.
Service owners must be able to connect to multiple instances from a service, which is a common scenario when decomposing instances.
Secure an instance
Service owners must be able to securely connect only to instances available in their service’s secrets management. In Provisioner, update service accounts to use predefined roles offered by GCP Cloud SQL.
Service owners must the Cloud SQL Auth Proxy which is the preferred way of securely connecting to a Cloud SQL instance. This will be enabled by default for all services and cannot be opted out.
When connecting to an unmanaged Cloud SQL instance (i.e., not provisioned by Runway), a Private Service Connect endpoint must be used to access the Cloud SQL instance in the external project’s VPC network.
Monitor an instance
Service owners must be able to monitor database instances. Runway will use runway_exporter for Stackdriver’s Cloud SQL metrics. In Runbooks, use monitoring and capacity planning for GCP Cloud SQL for PostgreSQL.
In the future (when the GKE cluster supports metric exporters), Runway will create new Runway postgres_exporter
for each Runway database instance to scrape metrics using Prometheus.
Deprovision an instance
Service owners must be able to disconnect database instances from their service. Service owners must be able to destroy database instances.
Instances will have deletion protection enabled by default. Deletions will require a change management issue to be prepared and reviewed by SREs before deletion can happen.
Pricing for an instance
Service owners must be able to be attribute costs of database instances. In Provisioner, attach standard resource labels to Cloud SQL instances using GitLab Infrastructure standards.
Historically, Runway’s operating cost has been negligible due to GCP Cloud Run pricing. By introducing support for database instance capabilities, this is no longer the case. For contrast, an enterprise plus Cloud SQL instance on a db-perf-optimized-N-2
machine with HA and Data Cache enabled in us-central1
would cost ~$520/month (calculated using the GCP calculator).
Pricing is based on components for tier, capacity, region, and replicas. When accessing a Cloud SQL instance from a Cloud Run Service client in a different region, GCP charges you for network egress traffic from Cloud SQL instances to your client application for total GB transferred from one region to the other.
The xsmall
and small
machine types should be used for development and testing, and medium
and above machine types instances should be used for GA features.
Backup for an instance
Runway must configure Cloud SQL instances to maintain regular backup and allow Service owners to override the default backup schedule.
Restoring an instance
Runway will be responsible for monitoring backups and providing tooling/documentation for performing restores. Runway will also periodically test backups by restoring to another pre-provisioned instance like how CustomerDot Cloud SQL is periodically verified with a scheduled pipeline.
Service owners need to configure the periodic backup testing by providing information such as table name, the table’s owner name, query string, etc.
Service owners must be able to restore their database instance in the case of an outage or an accidental deletion. They may require the help of SREs if they do not have sufficient permissions.
Performing database schema and data migration
Service owners must be able to use Runway Jobs to trigger database schema and data migrations. Runway Jobs (Cloud Run Jobs) must be able to connect to the Cloud SQL instance.
Design and implementation details
Fully-managed
As a fully-managed solution in GCP, scalability, availability, and maintenance are considered features.
Scalability
Enterprise instances can be vertically scaled up to a maximum of 96 vCPU and 638976 MB memory in the case of a db-custom-96-638976
machine. Enterprise Plus instances can be
vertically scaled up to a maximum 128 vCPU and 864 GB memory for db-perf-optimized-N-128
machine type.
Based on self-managed Patroni instances for GitLab.com, CPU utilization saturation resource has been primary bottleneck. GCP Cloud SQL instances can be configured to automatically increase storage.
Availability
The Enterprise
tier offers 99.95% uptime SLA and <30s
maintenance downtimes.
The Enterprise Plus
tier offers 99.99% uptime SLA and near zero downtimes for
several planned operations.
High availability can be configured by setting the availability_type
as REGIONAL
so that a primary instance and a standby instance are created in separate zones.
Maintenance
Instances can enable maintenance policy that is routinely scheduled.
Runway will be responsible for defining a maintenance window for all services by default. Service owners will be responsible for optionally overriding default maintenance policy depending on preference.
Runway will be responsible for using capacity planning process and maintenance notifications annotations to ensure system memory utilization is rightsized for maintenance windows depending on workload instance traffic.
Service owners will be responsible for exponential backoff to handle client re-connections after maintenance failover in non-LabKit supported programming languages. Runway will be responsible for implementing PostgresSQL client functionality in LabKit supported programming languages, on a just-in-time case-by-case basis.
Cloud SQL will upgrade the minor version of the database during maintenance windows. New versions are supported within 30 days of general availability.
CloudSQL doesn’t automatically upgrade to a newer major version. Service owners should be able to perform In-Place Major Version upgrades by changing the version
of their instance using a merge request on provisioner. SRE/DBRE support may be required. Important to notice that CloudSQL Major Version Upgrades require downtime.
Disaster Recovery
Unlike Redis, PostgreSQL database is likely to be storing data to be persisted. Runway-managed Cloud SQL instances will be configured with default backup configuration and point-in-time-recovery enabled. Deletion protection must be enabled for all database instances.
A runbook guide for restoring an instance must be available for Runway platform owners and SREs to use in the event of an outage. For planned high-risk CRs, steps for manual PITR snapshots should be available in the runbook guide and be referenced in the change issue.
Data Model
Workload
is a workload in GCP Cloud Run (e.g. service, job). Instance
is a PostgreSQL instance of GCP Cloud SQL for PostgreSQL. Workload
must be able to connect to multiple Instance
s (e.g. like GitLab with main/ci/sec/embedding). Instance
must be able to be provisioned and configured to one or more environments (e.g. staging
, production
).
To limit initial complexity, we only permit either a 1:1 workload-to-database or n:1 workloads-to-database relationships to allow for scenarios where a service read/writes to a database and a job handles schema migrations.
1:n workload-to-databases relationships will not be allowed initially to limit the maintenance toil on the Runway platform. Service owners should instead create additional logical databases within a physical instance. Database decomposition can be requested when there are actual scalability concerns.
Terraform
In Provisioner, Runway must manage IaC for google_sql_database_instance
resource.
Infrastructure has an existing Terraform module to provision GCP Cloud SQL instances.
Not all Cloud SQL instances in config-mgmt
are using this module, with some instances like sentry, engineering productivity and ops using the
Google Cloud SQL Terraform Module. Right now, the GitLab-managed module is
based on an old fork so it does not make sense to extend it any further. The terraform-google-sql-db
project also supports terraform modules for
restore and
backup.
JSON Schema
In Provisioner, Runway must include JSON Schema for database instances.
Below is an illustrative examples of provisioning and configuring an instance.
postgres_instances: - name: runway-db-example identifier: EXAMPLE provider: GCP db_version: "POSTGRES_16" region: us-east1 zone: us-east1-b database_configuration: additional_databases: - name: runway additional_users: - name: job env_configuration: staging: instance_type: small disk_size: 5 backup_configuration: transaction_log_retention_days: 2 retained_backups_count: 3 production: instance_type: medium # xsmall/small/medium/large/xlarge/2xlarge/3xlarge disk_size: 10 backup_configuration: transaction_log_retention_days: 2 retained_backups_count: 3 labels: owner_email_handle: scalability-team department: eng-infra department_group: eng-infra-scalability product_category: scalability
Additionally, service inventory will be extended to connect a service to instances. Here’s an illustrative example:
---inventory:- name: cloudrun-service project_id: 123456789 postgres_instances: - cloudrun-db
postgres_instances:- name: cloudrun-db provider: GCP ...
As the examples demonstrate, YAML Schema should be flexible enough to support provisioning and configuring any supported attributes, regardless of instance type.
On the Reconciler, the runway*.yml
needs to be configured using spec.cloud_providers.gcp.cloudsql_instances
like below:
# https://gitlab-com.gitlab.io/gl-infra/platform/runway/runwayctl/manifest.schema.htmlapiVersion: runway/v1kind: RunwayService # or RunwayJobmetadata: # omittedspec: cloud_providers: gcp: cloudsql_instances: - psc_enabled: false # Runway-managed Cloud SQL instance instance_connection_name: gitlab-runway-staging:us-central1:runway-db-example - psc_enabled: true # unmanaged Cloud SQL instance instance_connection_name: gitlab-staging:us-central1:runway-db-external
Integration
Vault
In Provisioner, Runway must store database instance credentials, so it can be accessed by Reconciler.
Runway currently uses Vault for secrets management, which is workload and runtime agnostic. As a result, Vault will be the mechanism to connect to an instance.
Secrets for an instance will be stored under the following:
Mount | Path |
---|---|
runway | reconciler/$WORKLOAD/datastores/$ENV/$REGION/ |
Fields for an instance will include the following:
Field | Description |
---|---|
Instance Connection Name | The connection name of the master instance to be used in connection strings. |
Database | The name of the PG database to connect to. |
As part of a paved road approach, a default user, password, and database is created for each new Cloud SQL instances.
The service owner experience should be very similar to competing SaaS offerings, e.g. Heroku Redis add-ons. Secrets will be prefixed with RUNWAY_DB_$IDENTIFIER_<DEFAULT/additional identifier>_$FIELD
, e.g. RUNWAY_DB_TEMP_DEFAULT_USER
. Additional databases will be RUNWAY_DB_TEMP_YYY_DB_NAME
where YYY
is the database name. Additional users will be stored as RUNWAY_DB_TEMP_XXX_USER
and RUNWAY_DB_TEMP_XXX_PWD
where XXX
is the username.
Workloads connected to external Cloud SQL instances can configure these fields using the env-$ENVIRONMENT.yml
file for non-sensitive configurations and vault path env/$ENV/service/$WORKLOAD/
for secrets.
As Cloud Run instance connections to managed Cloud SQL instances will use IAM authentication, no passwords will be read by the Runway workload.
Cloud SQL Auth Proxy
When creating a GCP Cloud SQL instance, Runway will enable connection enforcement by default and cannot be opted out.
Both Cloud Run workloads and GKE applications can connect to Cloud SQL instances through an auth proxy sidecar.
VPC
To connect to GCP Cloud SQL from GCP Cloud Run service/job, Runway must prepare direct vpc egress. We can re-use prior art from Cloud Run connection to Memorystore instances as Cloud Run has vpc access configured. By configuring the Runway manifest for the Runway workloads with Cloud SQL instance connection details, the vpc access will be enabled automatically.
Note on future-proofing: for GKE, when connecting using private IP, the cluster VPC-native and peered with the same Virtual Private Cloud (VPC) network as the Cloud SQL instance.
IAM
In Provisioner, Cloud SQL Admin and Compute Viewer roles are required for managing instances and GCP Cloud Run service accounts should be updated with the roles/cloudsql.client
role for the project containing the GCP Cloud SQL instance.
For Runway workloads with external Cloud SQL instances, the crun-$WORKLOAD@gitlab-runway-$ENV.iam.gserviceaccount.com
service account must be granted the roles/cloudsql.client
and roles/cloudsql.instanceUser
role in the GCP project where the Cloud SQL instance resides.
Metrics
Runway must offer SLIs, SLOs, dashboards, and saturation montioring for Cloud SQL instances. In Runbooks, introduce a Runway Database archetype for Runway Cloud SQL instances.
Like with Runway Redis, we will scrape Stackdriver metrics using runway exporter until the GKE cluster is up with metric exporters.
Constraints
In addition to non-goals, the following is unsupported:
- GCP Cloud SQL data migration
- GCP Cloud SQL detailed performance analysis (e.g. eBPF probes, tcpdump traffic analysis, etc)
Alternative Solutions
- Directly connect to Cloud SQL without the auth proxy.
- Take a similar approach as Auto Devops by using a single
DATABASE_URL
string. - Deploy
postgres_exporter
instead of the interim workaround to use stackdriver metrics.