GrandLine Architecture Intelligence. Operations & Reliability White Paper

Revision: 2026-Q2 · Audience: SRE, platform engineering, and customer operations teams evaluating how we run GrandLine Architecture Intelligence and how we expect you to run the self-hosted edition.

The goal of this document is to tell you how we run GrandLine in production so you can judge whether we meet your reliability bar, and to tell you how to run the self-hosted edition so you can hit the same bar yourself.

1. Summary

GrandLine Architecture Intelligence is a metadata platform. It is not a workload proxy, not on the request path for anything you run in production, and not a dependency of any system that handles customer-facing traffic. This shapes everything about how we operate it:

Availability target: 99.9% for the dashboard and API; best-effort for scheduled scans (retried with backoff).
Blast radius of an outage: loss of visibility, not loss of production. No customer workload goes down because GrandLine is down.
Recovery time objective: a 30-minute RTO is achievable from snapshots; we recommend operators rehearse it in DR drills.
Recovery point objective: 5 minutes for the primary datastore (Aurora PITR); 24 hours for cold-storage archives.

Deployment primitives are Kubernetes + Helm (primary) and Docker Compose (local dev and small PoCs only). Observability is OpenTelemetry + Prometheus + Loki + Tempo + Grafana. Upgrades are blue-green with schema migrations gated on a canary tenant.

2. Service-level objectives

These are the three SLOs we recommend self-hosted operators target.

SLO	Target	Measurement window	Error budget
Dashboard availability. `GET /api/v1/me` returns 2xx within 1s	99.9%	30-day rolling	43 min 49 sec / month
Scan completion. a connector scan started before the top of the hour finishes within 15 minutes (p95)	99.0%	30-day rolling	7.3 hours / month
Report generation. a requested PDF/DOCX is ready within 2 minutes (p95)	99.5%	30-day rolling	3.6 hours / month

Why not 99.99%? A four-nines target requires multi-region, active-active writes, and engineering investment we do not believe customers should pay for in this category. Read visibility that is down for 20 minutes a month is fine for architecture / FinOps use cases; request-path tools are a different product. We say so up front so customers can decide whether GrandLine is the right tool for them.

Error budget policy: when a 30-day window burns 50% of its budget, we halt all non-reliability work for the following sprint and run a post-incident review. When it burns 100%, the service owner pages and we freeze deploys until the budget recovers.

3. Reference deployment topology

Per region, we run:

Component	Primitive	Sizing
API (`apps/api`, NestJS)	EKS deployment, 3+ replicas, HPA 3–20 on CPU+RPS	sized per workload
Worker (`apps/worker`, BullMQ)	EKS deployment, 6+ replicas, KEDA autoscaled on queue depth	sized per workload
Dashboard (`apps/dashboard`, Next.js)	EKS deployment behind CloudFront	sized per workload
Postgres	Aurora PostgreSQL 16, 1 writer + 2 readers	sized per workload, gp3 storage, PITR on
Redis	ElastiCache Redis 7.2, single-shard cluster mode	sized per workload
S3 (reports + audit archive)	1 bucket, Object Lock on `audit/` prefix	.
NAT	3× NAT GW, one per AZ	.
Load balancer	ALB + CloudFront for dashboard; ALB for API	.

Network layout: 3 AZs, one private subnet per AZ for compute, one data-tier subnet per AZ for Aurora/Redis. Aurora and Redis are unreachable from the internet. The ALB is internet-facing and terminates TLS with a certificate from ACM; nothing else is reachable from the internet.

A separate AWS account. grandline-observability-<region>. holds the Loki, Tempo, and Grafana Cloud exporters. Production AWS accounts have IAM that allows them to write to the observability account but not read from it, so a compromised production role cannot tamper with its own audit or logs.

4. Deployment topology (self-hosted)

Self-hosted is shipped as a Helm chart: helm install grandline grandline/grandline. It does not require a specific cloud. We test on EKS, AKS, GKE, and vanilla kubeadm + MetalLB (for true air-gap installs). The chart expects:

Required:

A Kubernetes cluster ≥ 1.27.
An ingress controller (we default to ingress-nginx but any will do; the chart exposes ingress class).
A Postgres 16 database (we can provision one via the bitnami/postgresql subchart for evaluation; production should use managed Postgres).
A Redis 7+ cluster.
An object store that speaks the S3 API (S3, Azure Blob via MinIO gateway, Cloud Storage (GCS) via interop, MinIO self-hosted).

Optional:

cert-manager for automated TLS.
external-dns for route automation.
KEDA for queue-based worker autoscaling (chart ships a fallback HPA).

A small eval profile (values-eval.yaml) runs everything in one namespace with in-cluster Postgres and Redis on a single 4 vCPU / 8 GB node. Production profile (values-prod.yaml) assumes managed Postgres/Redis, sets resource requests / limits, and enables PodDisruptionBudgets.

4.1 Cloud-hosted self-hosted patterns

We do not assume a specific host cloud for the self-hosted edition. The chart works the same on all three:

Pattern	What the customer runs	Data path
AWS-hosted self-hosted	EKS + Aurora Postgres + ElastiCache Redis + S3	Scans hit AWS/Azure/GCP from inside the EKS VPC
Azure-hosted self-hosted	AKS + Azure Database for PostgreSQL + Azure Cache for Redis + Azure Blob Storage (via S3 interop or MinIO gateway)	Scans hit AWS/Azure/GCP from inside the AKS VNet
GCP-hosted self-hosted	GKE + Cloud SQL Postgres + Memorystore Redis + Cloud Storage (GCS) (S3 interop)	Scans hit AWS/Azure/GCP from inside the GKE VPC
Customer-managed Kubernetes	Any CNCF-certified cluster + any Postgres 16 + any Redis 7 + any S3-compatible object store	Scans hit the customer's cloud APIs over egress they control

All four discover AWS, Azure, and GCP. the host cloud is orthogonal to the clouds you scan. You can run self-hosted on GKE and discover an AWS estate. Credentials for discovery are stored in the cluster's native secret store (K8s Secrets with a sealed-secrets or external-secrets controller. chart supports both).

4.2 Air-gapped install

For customers who cannot reach public registries:

Mirror ghcr.io/grandline/ and the bitnami/ dependencies into the customer's private registry.
Run helm pull grandline/grandline --untar and re-point image.registry in values.yaml to the mirror.
The chart has no runtime calls to the public internet. License validation is offline for air-gapped Enterprise installs (signed license file, validated against a public key baked into the image).

5. Observability

GrandLine is instrumented with OpenTelemetry SDKs across the API, worker, and dashboard. Traces, metrics, and logs carry a common install_id attribute (and tenant_id in multi-tenant installs).

5.1 Signals

Metrics. Prometheus format. The chart ships ServiceMonitor CRDs for each component. Key metrics:
grandline_api_request_duration_seconds (histogram, labelled by route + tenant + status)
grandline_scan_duration_seconds (histogram, labelled by provider + tenant)
grandline_queue_depth (gauge, labelled by queue name)
grandline_findings_open_total (gauge, labelled by severity)
Logs. structured JSON (pino in Node, structlog in Python). Shipped to Loki or any JSON-capable log store. Every log line carries trace_id.
Traces. sent to an OTLP endpoint (Tempo default). ~5% head sampling plus 100% tail sampling for errors.

5.2 Dashboards

The chart ships Grafana dashboards as ConfigMaps with a grafana_dashboard label. Dashboards:

Platform overview. request rate, latency, error rate, saturation, pod health.
Scan pipeline. per-provider scan duration, queue depth, failure rate, rate-limit 429s from cloud APIs.
Data-plane. DB CPU / connections / replication lag, Redis ops/sec, S3 PUT errors.
Findings. open-findings-by-severity trendline, top noisy rules, suppression rate.
Cost of running GrandLine - infrastructure cost per install (and per tenant in multi-tenant installs).

5.3 Alerts

Alerts ship as PrometheusRule CRDs. The full list is in docs/16-observability.md; the ones tied to SLOs are:

API latency burn. fires on sustained SLO burn against the 99.9% target.
Scan stuck. fires if any queue has depth > 500 for 30 minutes.
DB connection exhaustion. fires at 80% of max_connections sustained for 5 minutes.

6. Backup and restore

6.1 Postgres

PITR window: 7 days, Aurora continuous.
Snapshots: daily, retained 30 days, encrypted with tenant-scoped customer managed key.
Logical exports: weekly pg_dump piped to S3 with Object Lock, retained 1 year. Used as a last resort when Aurora itself is unreachable.

6.2 Redis

Not backed up. Redis is BullMQ queue + short-lived cache. On loss, pending jobs are re-enqueued from the Postgres source of truth on the next scheduled scan.

6.3 S3 (reports + audit)

Versioning on.
Cross-region replication off by default (regional pinning is a data-residency requirement) but available on Enterprise for customers with different DR needs.
Audit prefix is Object-Lock-protected (Governance, 7 years).

6.4 Restore drills

Every quarter we pick a random production tenant, restore its data to an isolated cluster, verify the dashboard, spot-check reports, then tear it down. The drill is tracked in a ticket, the outcome is published internally, and the summary is available to Enterprise customers under NDA.

6.5 Restore playbook (self-hosted)

The Helm chart includes a grandline-restore one-shot Job. Given a Postgres dump URL and an S3 archive URL, it:

Stops the API and worker deployments.
Restores the DB (pg_restore with --clean --if-exists).
Syncs the S3 archive into the target bucket.
Runs prisma migrate deploy to bring the schema to current.
Starts a single worker pod to replay any pending BullMQ jobs.
Scales API + worker back up.

End-to-end restore on a ~5 GB tenant dataset: ~25 minutes, measured.

7. Upgrades

We recommend operators schedule upgrades on a regular cadence inside their own change window. A typical process:

Canary. the new image ships to a single internal tenant and a volunteer cohort (3 customers) with a 24-hour soak.
Blue-green. the new image is rolled to a parallel deployment, traffic is shifted 10%/50%/100% with automatic rollback if SLO burn rate exceeds 2×.
Schema migration. run before API swap, always backward-compatible for one release (expand → contract pattern). A migration that drops a column lands one release after the code that stopped reading it.
Feature flags. every user-visible behaviour change is behind a feature flag. Flags are rolled per tenant.

For self-hosted:

Semver discipline: major.minor.patch. Minor = backward-compatible schema + code. Major = potentially breaking.
Upgrade command: helm upgrade grandline grandline/grandline --version x.y.z.
Skipping versions: you may skip minor versions within the same major; you may not skip majors. Customers on an older major run the release-note-documented migration before upgrading.
Rollback: helm rollback grandline is safe for any minor upgrade; major upgrades require a DB restore and are documented per-release.

8. Incident response

We run a written incident-response process. Severity levels and response targets apply to Enterprise contracts. Post-mortems for incidents with customer impact are blameless and written within 5 business days, and are shared with affected Enterprise customers on request.

9. Capacity and scale

9.1 What "scale" looks like in practice

GrandLine is not on the request path. The scaling dimensions we care about are:

Tenants. currently designed for tens of thousands per region. Tenant bootstrapping is a ~30-second job (provision customer managed key, create roles, seed rule catalogue).
Resources per tenant. individual tenants with 500k+ resources have been exercised in load tests. The graph schema and cost-daily table scale with horizontal shards on tenantId hash; sharding is not needed before ~2M resources per tenant.
Scans per hour. the worker fleet autoscales on queue depth (KEDA). A single worker handles ~3 concurrent tenant scans; more workers = more throughput.
Reports per hour. report rendering is CPU-bound (ReportLab, python-docx). A separate report-worker pool scales independently.
Diagram rendering. the layout (ELK) and the render (Cytoscape serverside or client-side, see the Diagram Quality technical note) are O(N log N) where N is nodes + edges. For > 1000 nodes in a single view we auto-split by account / VPC / tag. No single view is rendered with more than ~800 nodes.

9.2 Capacity planning

Each component has a published "unit of capacity" table:

API. 1 replica = 200 req/s (p95 < 200 ms) on 1 vCPU / 1 Gi.
Worker (scan). 1 replica = 3 concurrent provider scans.
Worker (report). 1 replica = 4 concurrent report renders.
Postgres. sized by resources × 1 kB + findings × 0.5 kB + cost_daily × 0.2 kB. A tenant with 50k resources, 20k open findings, 13 months of daily cost ≈ 1.1 GB.

These numbers are regenerated every major release from load tests. They land in docs/16-observability.md.

10. Known growth risks

We keep a living list. Today:

Diagram rendering at very large tenants. Tenants with > 1M resources can hit memory pressure in layout. Mitigation: auto-split by account, then by VPC, then by tag, progressive rendering; roadmap item to offload layout to a GPU-backed ELK server.
Report generation bursts. Hundreds of PDFs at once saturate CPU. Mitigation: separate report-worker pool with its own HPA, queue priority for interactive users over scheduled exports.
Tenant isolation under compromise. A single compromised operator session could read multiple tenants within the 60-minute break-glass window. Mitigation: per-tenant customer managed keys mean the operator has to Decrypt N times, each logged, each visible to the tenant's audit stream.
Noisy neighbours on Aurora. A single tenant running very expensive custom-rule queries can starve others. Mitigation: per-tenant statement timeouts, per-tenant connection-pool quotas, Aurora db.r7g.2xlarge headroom of 50%.
Billing and metering complexity. Resources-based metering is simple to sell but complicated at edge cases (transient resources, decommissioned accounts). Mitigation: we meter on the high-water mark of distinct Resource.id seen in the period, documented clearly; customers can query their own counter via the API.
Support burden as tenant count grows. Mitigation: Free tier is self-serve only, Pro tier is email, Enterprise tier is dedicated channel; internal tooling for operators scales linearly with headcount.
Supply-chain visibility. CycloneDX SBOMs are published but not easily diff-able. Roadmap: in-product SBOM diff and CVE alert subscription.

11. Support model

Free. community forum, docs, GitHub issues on the public repo. No SLA.
Pro. email support at [email protected]. First response within 1 business day.
Enterprise. dedicated Slack Connect or Teams channel, 24×7 for SEV1/SEV2, named CSM, quarterly architecture review.

Self-hosted customers get the same support tiers scoped to the control plane; we do not debug customers' Kubernetes clusters but we do give you the tools and logs to do it yourself.

13. Contact

For operations questions, the trust pack (SOC 2 summary, last penetration test), or anything else: [email protected]. Put [ops] or [security] in the subject line so we route it correctly.