Memorystore for Redis
Summary
Runway currently supports stateless services. Candidate services have requested support for stateful services. While stateless services were ideal for the first iteration, limitations are now preventing new services from onboarding and existing services from increasing adoption.
As part of strategy to support stateful services, the next logical iteration is application caching instances. Propose caching instance capabilities using GCP Memorystore for Redis to allow service owners to self-serve fully-managed cloud infrastructure.
Motivation
The importance and urgency for Redis features, such as caching, session management, etc., is to remove key blockers for candidate services. Most notably, CustomersDot and AI Gateway.
Many competing PaaS offerings include features for Redis instances. As a result, Runway has opportunities to provide Redis instances as part of the workloads deployed using Runway.
Goals
- Service owners can self-serve Redis caching instances
- Support GCP Memorystore for Redis
- Compatibility with GCP Cloud Run Services
- Compatibility with GCP Cloud Run Services Multi-Region
- Compatibility with GCP Cloud Run Jobs
- Compatibility with GCP Google Kubernetes Engine (note: stretch goal, in progress)
Non-Goals
- Service owners can self-serve Redis persistence instances (reason: no specific requirement, still feasible w/ proposal). As a result, Redis instances are not intended to withstand dataloss. When persistence instances are eventually supported, guarantees will be made for DR/RTO
- Support GCP Memorystore for Redis Cluster (reason: too complex for first iteration, refer to alternative solutions section)
- Support GCP Memorystore for Memcached (reason: not standardized at GitLab)
Proposal
Propose managing the entire lifecycle of Redis caching instance capabilities, so service owners can self-serve using Runway.
Provision an instance
Service owners must be able to create and manage caching instance. In Provisioner, inventory is updated to support provisioning GCP Memorystore for Redis instance using Terraform, e.g. google_redis_instance
.
Configure an instance
Service owners must be able to configure caching instances. In Provisioner, support configuration for managing both Redis instance (e.g. memory size capacity) and Redis configuration (e.g. maxmemory policy, parameters).
Connect to an instance
Service owners must be able to connect to caching instance from their service. In Provisioner, store caching instance host and credentials under service’s path using Vault, e.g. runway/env/staging/example-service
. In Reconciler, make caching instance credentials accessible to service using GCP Cloud Run secrets management.
Service owners must be able to connect to multiple instances from a service, which is a common scenario for caching and persistence instances.
Secure an instance
Service owners must be able to securely connect only to instances available in their service’s secrets management. In Provisioner, update service accounts to use predefined roles offered by GCP Memorystore for Redis.
Service owners must use authentication for caching instances, which will be enabled by default for all services and cannot be opted out.
Service owners must be able to enable in-transit encryption and configure clients.
Monitor an instance
Service owners must be able to monitor caching instances. Create new Runway redis_exporter
for each Runway Redis instance to to scrape metrics using Prometheus. In Runbooks, use monitoring and capacity planning for GCP Memorystore for Redis.
Deprovision an instance
Service owners must be able to disconnect caching instances from their service. Service owners must be able to destroy caching instances.
Pricing for an instance
Service owners must be able to be attribute costs of caching instances. In Provisioner, attach standard resource labels to Redis instances using GitLab Infrastructure standards.
Historically, Runway’s operating cost has been negligible due to GCP Cloud Run pricing. By introducing support for caching instance capabilities, this is no longer the case. For contrast, a single standard tier GCP Memorystore for Redis instance can cost $805.92/month.
Pricing is based on components for tier, capacity, region, and replicas. When accessing a Redis instance from a Cloud Run Service client in a different region, Memorystore charges you for network egress traffic from Redis instances to your client application for total GB transferred from one region to the other.
Basic tier instances should be used for development and testing, and Standard tier instances should be used for GA features.
Design and implementation details
Fully-managed
As a fully-managed solution in GCP, scalability, availability, and maintaince are considered features.
Scalability
Instances can be vertically scaled up to a maximum of 300 GB and supports up to 16 Gbps of network throughput. Instances can be horizontally scaled with read queries across up to five replicas. When scaling a Standard Tier instance, applications experience downtime of less than a minute.
Based on self-managed Redis instances for GitLab.com, CPU utilization saturation resource has been primary bottleneck. Horizontal scaling with read replicas are only suited for specific workloads and require trade-offs with availability. For long-term horizontal scaling solution, Runway will need to eventually support GCP Memorystore for Redis Cluster.
Availability
Instances can be provisioned as Standard Tier types, which are replicated across zones, monitored for health and have fast automatic failover. According to GCP, Standard Tier instances also provide an SLA of 99.9%.
Maintenance
Instances can enable maintenance policy that is routinely scheduled.
Runway will be responsible for defining a maintenance window for all services by default. Service owners will be responsible for optionally overriding default maintenance policy depending on preference.
Runway will be responsible for using capacity planning process and maintenance notifications annotations to ensure system memory utilization is rightsized for maintenance windows depending on workload instance traffic.
Service owners will be responsible for exponential backoff to handle client reconnections after maintenance failover in non-LabKit supported programming languages. Runway will be responsible for implementing Redis client functionality in LabKit supported programming languages, on a just-in-time case-by-case basis.
Data Model
Workload
is a workload in GCP Cloud Run (e.g. service, job). Instance
is a Redis instance of GCP Memorystore for Redis that will initially support type CACHING
and can be extended to support type PERSISTENCE
. Workload
must be able to connect to multiple Instance
s (e.g. caching, persistence). Instance
must be able to be provisioned and configured to one or more environments (e.g. staging
, production
).
Assumption for 1:1 mapping may simplify the first iteration, however, it will not be as easily extendable for common scenarios, such as an instance being accessed by both a service and/or job, or a service connecting to multiple instances for caching and background job processing.
Terraform
In Provisioner, Runway must manage IaC for google_redis_instance
resource.
Infrastructure has an existing Terraform module to provision GCP Memorystore for Redis instance. Right now, outputs are limited to the private IP address of Redis instance and the module is only used for CustomersDot, so it does not make sense to extend it any further.
Due to Runway’s multi-tenant PaaS use case, we can leverage Google Cloud Memorystore Terraform Module, which already includes outputs for host
, auth_string
, read_endpoint
, etc.
JSON Schema
In Provisioner, Runway must include JSON Schema for caching instances. By introducing type
property, Runway can both provide preconfigured defaults for CACHE
type (e.g. maxmemory
) and be easily extended to eventually support PERSISTENCE
type configuration using conditional required properties, e.g. rdb_snapshot_period
.
Below are a few illustrative examples.
Provisioning basic instance:
Provisioning standard high availability instance:
Provisioning regional instance:
Configuring instance:
Provisioning persistent instance:
Additionally, service inventory will be extended to connect a service to instances. Here’s an illustrative example:
As examples demonstrate, JSON Schema should be flexible enough to support provisioning and configuring any supported attributes, reguardless of instance type.
Multi-Region
By default, provisioning caching instance will use same default region as GCP Cloud Run Service, e.g. us-east1
. Optionally, regions
attribute can be set to provision a cache instance per region:
Additionally, replication can be enabled for each regional cache instance:
Separate regional caches w/ regional read replication are supported, not global cache replicated across multiple regions.
Integration
Vault
In Provisioner, Runway must store caching instance credentials, so it can be accessed by Reconciler.
Runway currently uses Vault for secrets management, which is workload and runtime agnostic. As a result, Vault will be the mechanism to connect to an instance.
Secrets for an instance will be stored under the following:
Mount | Path |
---|---|
runway | env/$RUNWAY_ENV/service/$RUNWAY_SERVICE_ID/runway_redis/$RUNWAY_REGION/* |
Fields for an instance will include the following:
Field | Description |
---|---|
Password | AUTH string for a Redis instance. |
Host | Hostname or IP address of the exposed Redis endpoint used by clients to connect to the service. |
Port | The port number of the exposed Redis endpoint. |
Read Endpoint | Hostname or IP address of the exposed readonly Redis endpoint. |
Read Port | The port number of the exposed readonly redis endpoint. |
The service owner experience should be very similar to competing SaaS offerings, e.g. Heroku Redis add-ons. When a service is connected to a single instance, secrets will be prefixed with RUNWAY_REDIS_
, e.g. RUNWAY_REDIS_HOST
. When a service is connected to multiple instances, secrets will be suffixed with a unique identifier, e.g. RUNWAY_REDIS_HOST_$ID
.
AUTH
When creating a Memorystore for Redis instance, Runway will enable authentication by default and cannot be opted out.
The AUTH string is a UUID that Runway will retrieve using Terraform module and store in Vault under service path, so service clients can retrieve for authenticating during connection.
To rotate AUTH string, authentication must unfortunately be temporarily disabled then re-enabled. Once a client authenticates with an AUTH string, it remains authenticated for the lifetime of that connection, even if you change the AUTH string. However, autoscaling new GCP Cloud Run instances, or new connections from application-side connection pooling will still result in using previous AUTH string. To reduce impact on availability, service owners must handle failing connections as supported by Redis client library, and re-deploy service immediately after rotation.
VPC
To connect to GCP Memorystore for Redis instance from GCP Cloud Run service, Runway must prepare VPC network egress.
Based on the constraints of Runway’s multi-tenant PaaS use case, using Serverless VPC Access Connector should provide the simplest isolation by each service having a connector when using caching instance. In Reconciler, configure vpc_access
for service.
Worth noting: VPC will likely need to be revisited for compatibility with GCP Google Kubernetes Engine once runtime becomes more stable.
IAM
In Provisioner, roles/redis.admin
is required for managing instances and GCP Cloud Run service accounts should not require updates roles due to network.
Metrics
Runway must offer SLIs, SLOs, dashboards, and saturation montioring for Redis caching instances based on redis_exporter
metrics. In Runbooks, use Redis archetype for Runway Redis instances.
Runway must create a new Runway redis exporter to scrape Runway Redis instances and remote write to Mimir using existing runway
tenant, similar to how existing stackdriver_exporter
scrapes Runway Cloud Run service metrics. VPC peering, firewall rules, and scrape configuration with multi-targets will need to be setup to access Runway Redis instances located in gitlab-runway-*
GCP projects from remote write redis_exporter release located in gitlab-*
GCP projects.
Constraints
In addition to non-goals, the following is unsupported:
- Runway connecting multiple services to single shared Redis instance
- Runway disaster recovery and recovery time objective for Redis caching instances
- GCP Memorystore data migration
- GCP Memorystore detailed performance analysis (e.g. CPU profiling, tcpdump traffic analysis, rdb memory space analysis, etc)
Alternative Solutions
- Cloud Run Integrations. Cloud Run Integrations is considered a Pre-GA feature, which is available “as-is” with limited support and not recommended for production usage.
- Direct VPC egress. Serverless VPC Access Connector provides service tenant isolation without complex network configuration.
- GCP Cloud Monitoring. Prometheus Redis Exporter provides prior art from GitLab.com and GitLab Dedicated for metrics, dashboards, alerts, and saturation monitoring.
- Memorystore for Redis Cluster. Redis Cluster will be reassessed before adding support for persistence instances.