GrandLine Architecture Intelligence. Operations & Reliability White Paper
Revision: 2026-Q2 · Audience: SRE, platform engineering, and customer operations teams evaluating how we run GrandLine Architecture Intelligence and how we expect you to run the self-hosted edition.
The goal of this document is to tell you how we run GrandLine in production so you can judge whether we meet your reliability bar, and to tell you how to run the self-hosted edition so you can hit the same bar yourself.
1. Summary
GrandLine Architecture Intelligence is a metadata platform. It is not a workload proxy, not on the request path for anything you run in production, and not a dependency of any system that handles customer-facing traffic. This shapes everything about how we operate it:
- Availability target: 99.9% for the dashboard and API; best-effort for scheduled scans (retried with backoff).
- Blast radius of an outage: loss of visibility, not loss of production. No customer workload goes down because GrandLine is down.
- Recovery time objective: 30 minutes for the SaaS control plane; we regularly restore from snapshots in DR drills.
- Recovery point objective: 5 minutes for the primary datastore (Aurora PITR); 24 hours for cold-storage archives.
Deployment primitives are Kubernetes + Helm (primary) and Docker Compose (local dev and small PoCs only). Observability is OpenTelemetry + Prometheus + Loki + Tempo + Grafana. Upgrades are blue-green with schema migrations gated on a canary tenant.
2. Service-level objectives
We publish three SLOs for the SaaS. They are also the SLOs we recommend for self-hosted operators.
| SLO | Target | Measurement window | Error budget |
|---|---|---|---|
Dashboard availability. GET /api/v1/me returns 2xx within 1s | 99.9% | 30-day rolling | 43 min 49 sec / month |
| Scan completion. a connector scan started before the top of the hour finishes within 15 minutes (p95) | 99.0% | 30-day rolling | 7.3 hours / month |
| Report generation. a requested PDF/DOCX is ready within 2 minutes (p95) | 99.5% | 30-day rolling | 3.6 hours / month |
Why not 99.99%? A four-nines target requires multi-region, active-active writes, and engineering investment we do not believe customers should pay for in this category. Read visibility that is down for 20 minutes a month is fine for architecture / FinOps use cases; request-path tools are a different product. We say so up front so customers can decide whether GrandLine is the right tool for them.
Error budget policy: when a 30-day window burns 50% of its budget, we halt all non-reliability work for the following sprint and run a post-incident review. When it burns 100%, the service owner pages and we freeze deploys until the budget recovers.
3. Deployment topology (SaaS)
Per region, we run:
| Component | Primitive | Sizing |
|---|---|---|
API (apps/api, NestJS) | EKS deployment, 3+ replicas, HPA 3–20 on CPU+RPS | sized per workload |
Worker (apps/worker, BullMQ) | EKS deployment, 6+ replicas, KEDA autoscaled on queue depth | sized per workload |
Dashboard (apps/dashboard, Next.js) | EKS deployment behind CloudFront | sized per workload |
| Postgres | Aurora PostgreSQL 16, 1 writer + 2 readers | sized per workload, gp3 storage, PITR on |
| Redis | ElastiCache Redis 7.2, single-shard cluster mode | sized per workload |
| S3 (reports + audit archive) | 1 bucket, Object Lock on audit/ prefix | . |
| NAT | 3× NAT GW, one per AZ | . |
| Load balancer | ALB + CloudFront for dashboard; ALB for API | . |
Network layout: 3 AZs, one private subnet per AZ for compute, one data-tier subnet per AZ for Aurora/Redis. Aurora and Redis are unreachable from the internet. The ALB is internet-facing and terminates TLS with a certificate from ACM; nothing else is reachable from the internet.
A separate AWS account. grandline-observability-<region>. holds the Loki, Tempo, and Grafana Cloud exporters. Production AWS accounts have IAM that allows them to write to the observability account but not read from it, so a compromised production role cannot tamper with its own audit or logs.
4. Deployment topology (self-hosted)
Self-hosted is shipped as a Helm chart: helm install grandline grandline/grandline. It does not require a specific cloud. We test on EKS, AKS, GKE, and vanilla kubeadm + MetalLB (for true air-gap installs). The chart expects:
Required:
- A Kubernetes cluster ≥ 1.27.
- An ingress controller (we default to
ingress-nginxbut any will do; the chart exposes ingress class). - A Postgres 16 database (we can provision one via the
bitnami/postgresqlsubchart for evaluation; production should use managed Postgres). - A Redis 7+ cluster.
- An object store that speaks the S3 API (S3, Azure Blob via MinIO gateway, Cloud Storage (GCS) via interop, MinIO self-hosted).
Optional:
- cert-manager for automated TLS.
- external-dns for route automation.
- KEDA for queue-based worker autoscaling (chart ships a fallback HPA).
A small eval profile (values-eval.yaml) runs everything in one namespace with in-cluster Postgres and Redis on a single 4 vCPU / 8 GB node. Production profile (values-prod.yaml) assumes managed Postgres/Redis, sets resource requests / limits, and enables PodDisruptionBudgets.
4.1 Cloud-hosted self-hosted patterns
We do not assume a specific host cloud for the self-hosted edition. The chart works the same on all three:
| Pattern | What the customer runs | Data path |
|---|---|---|
| AWS-hosted self-hosted | EKS + Aurora Postgres + ElastiCache Redis + S3 | Scans hit AWS/Azure/GCP from inside the EKS VPC |
| Azure-hosted self-hosted | AKS + Azure Database for PostgreSQL + Azure Cache for Redis + Azure Blob Storage (via S3 interop or MinIO gateway) | Scans hit AWS/Azure/GCP from inside the AKS VNet |
| GCP-hosted self-hosted | GKE + Cloud SQL Postgres + Memorystore Redis + Cloud Storage (GCS) (S3 interop) | Scans hit AWS/Azure/GCP from inside the GKE VPC |
| Customer-managed Kubernetes | Any CNCF-certified cluster + any Postgres 16 + any Redis 7 + any S3-compatible object store | Scans hit the customer's cloud APIs over egress they control |
All four discover AWS, Azure, and GCP. the host cloud is orthogonal to the clouds you scan. You can run self-hosted on GKE and discover an AWS estate. Credentials for discovery are stored in the cluster's native secret store (K8s Secrets with a sealed-secrets or external-secrets controller. chart supports both).
4.2 Air-gapped install
For customers who cannot reach public registries:
- Mirror
ghcr.io/grandline/and thebitnami/dependencies into the customer's private registry. - Run
helm pull grandline/grandline --untarand re-pointimage.registryinvalues.yamlto the mirror. - The chart has no runtime calls to the public internet. License validation is offline for air-gapped Enterprise installs (signed license file, validated against a public key baked into the image).
5. Observability
GrandLine is instrumented with OpenTelemetry SDKs across the API, worker, and dashboard. Traces, metrics, and logs carry a common tenant_id (SaaS) or install_id (self-hosted) attribute.
5.1 Signals
- Metrics. Prometheus format. The chart ships ServiceMonitor CRDs for each component. Key metrics:
grandline_api_request_duration_seconds(histogram, labelled by route + tenant + status)grandline_scan_duration_seconds(histogram, labelled by provider + tenant)grandline_queue_depth(gauge, labelled by queue name)grandline_findings_open_total(gauge, labelled by severity)- Logs. structured JSON (
pinoin Node,structlogin Python). Shipped to Loki or any JSON-capable log store. Every log line carries trace_id. - Traces. sent to an OTLP endpoint (Tempo default). ~5% head sampling plus 100% tail sampling for errors.
5.2 Dashboards
The chart ships Grafana dashboards as ConfigMaps with a grafana_dashboard label. Dashboards:
- Platform overview. request rate, latency, error rate, saturation, pod health.
- Scan pipeline. per-provider scan duration, queue depth, failure rate, rate-limit 429s from cloud APIs.
- Data-plane. DB CPU / connections / replication lag, Redis ops/sec, S3 PUT errors.
- Findings. open-findings-by-severity trendline, top noisy rules, suppression rate.
- Cost of running GrandLine. infrastructure cost per tenant (SaaS) or per install (self-hosted).
5.3 Alerts
Alerts ship as PrometheusRule CRDs. The full list is in docs/16-observability.md; the ones tied to SLOs are:
- API latency burn. fires on sustained SLO burn against the 99.9% target.
- Scan stuck. fires if any queue has depth > 500 for 30 minutes.
- DB connection exhaustion. fires at 80% of max_connections sustained for 5 minutes.
6. Backup and restore
6.1 Postgres
- PITR window: 7 days, Aurora continuous.
- Snapshots: daily, retained 30 days, encrypted with tenant-scoped customer managed key.
- Logical exports: weekly
pg_dumppiped to S3 with Object Lock, retained 1 year. Used as a last resort when Aurora itself is unreachable.
6.2 Redis
Not backed up. Redis is BullMQ queue + short-lived cache. On loss, pending jobs are re-enqueued from the Postgres source of truth on the next scheduled scan.
6.3 S3 (reports + audit)
- Versioning on.
- Cross-region replication off by default (regional pinning is a data-residency requirement) but available on Enterprise for customers with different DR needs.
- Audit prefix is Object-Lock-protected (Governance, 7 years).
6.4 Restore drills
Every quarter we pick a random production tenant, restore its data to an isolated cluster, verify the dashboard, spot-check reports, then tear it down. The drill is tracked in a ticket, the outcome is published internally, and the summary is available to Enterprise customers under NDA.
6.5 Restore playbook (self-hosted)
The Helm chart includes a grandline-restore one-shot Job. Given a Postgres dump URL and an S3 archive URL, it:
- Stops the API and worker deployments.
- Restores the DB (
pg_restorewith--clean --if-exists). - Syncs the S3 archive into the target bucket.
- Runs
prisma migrate deployto bring the schema to current. - Starts a single worker pod to replay any pending BullMQ jobs.
- Scales API + worker back up.
End-to-end restore on a ~5 GB tenant dataset: ~25 minutes, measured.
7. Upgrades
SaaS upgrades happen weekly on Tuesdays at 14:00 UTC. The process:
- Canary. the new image ships to a single internal tenant and a volunteer cohort (3 customers) with a 24-hour soak.
- Blue-green. the new image is rolled to a parallel deployment, traffic is shifted 10%/50%/100% with automatic rollback if SLO burn rate exceeds 2×.
- Schema migration. run before API swap, always backward-compatible for one release (expand → contract pattern). A migration that drops a column lands one release after the code that stopped reading it.
- Feature flags. every user-visible behaviour change is behind a feature flag. Flags are rolled per tenant.
For self-hosted:
- Semver discipline:
major.minor.patch. Minor = backward-compatible schema + code. Major = potentially breaking. - Upgrade command:
helm upgrade grandline grandline/grandline --version x.y.z. - Skipping versions: you may skip minor versions within the same major; you may not skip majors. Customers on an older major run the release-note-documented migration before upgrading.
- Rollback:
helm rollback grandlineis safe for any minor upgrade; major upgrades require a DB restore and are documented per-release.
8. Incident response
We run a written incident-response process. Severity levels and response targets apply to Enterprise contracts. Post-mortems for incidents with customer impact are blameless and written within 5 business days, and are shared with affected Enterprise customers on request.
9. Capacity and scale
9.1 What "scale" looks like in practice
GrandLine is not on the request path. The scaling dimensions we care about are:
- Tenants. currently designed for tens of thousands per region. Tenant bootstrapping is a ~30-second job (provision customer managed key, create roles, seed rule catalogue).
- Resources per tenant. individual tenants with 500k+ resources have been exercised in load tests. The graph schema and cost-daily table scale with horizontal shards on
tenantIdhash; sharding is not needed before ~2M resources per tenant. - Scans per hour. the worker fleet autoscales on queue depth (KEDA). A single worker handles ~3 concurrent tenant scans; more workers = more throughput.
- Reports per hour. report rendering is CPU-bound (ReportLab, python-docx). A separate report-worker pool scales independently.
- Diagram rendering. the layout (ELK) and the render (Cytoscape serverside or client-side, see the Diagram Quality technical note) are O(N log N) where N is nodes + edges. For > 1000 nodes in a single view we auto-split by account / VPC / tag. No single view is rendered with more than ~800 nodes.
9.2 Capacity planning
Each component has a published "unit of capacity" table:
- API. 1 replica = 200 req/s (p95 < 200 ms) on 1 vCPU / 1 Gi.
- Worker (scan). 1 replica = 3 concurrent provider scans.
- Worker (report). 1 replica = 4 concurrent report renders.
- Postgres. sized by
resources × 1 kB + findings × 0.5 kB + cost_daily × 0.2 kB. A tenant with 50k resources, 20k open findings, 13 months of daily cost ≈ 1.1 GB.
These numbers are regenerated every major release from load tests. They land in docs/16-observability.md.
10. Known growth risks
We keep a living list. Today:
- Diagram rendering at very large tenants. Tenants with > 1M resources can hit memory pressure in layout. Mitigation: auto-split by account, then by VPC, then by tag, progressive rendering; roadmap item to offload layout to a GPU-backed ELK server.
- Report generation bursts. Hundreds of PDFs at once saturate CPU. Mitigation: separate report-worker pool with its own HPA, queue priority for interactive users over scheduled exports.
- Tenant isolation under compromise. A single compromised operator session could read multiple tenants within the 60-minute break-glass window. Mitigation: per-tenant customer managed keys mean the operator has to
DecryptN times, each logged, each visible to the tenant's audit stream. - Noisy neighbours on Aurora. A single tenant running very expensive custom-rule queries can starve others. Mitigation: per-tenant statement timeouts, per-tenant connection-pool quotas, Aurora
db.r7g.2xlargeheadroom of 50%. - Billing and metering complexity. Resources-based metering is simple to sell but complicated at edge cases (transient resources, decommissioned accounts). Mitigation: we meter on the high-water mark of distinct
Resource.idseen in the period, documented clearly; customers can query their own counter via the API. - Support burden as tenant count grows. Mitigation: Free tier is self-serve only, Pro tier is email, Enterprise tier is dedicated channel; internal tooling for operators scales linearly with headcount.
- Supply-chain visibility. CycloneDX SBOMs are published but not easily diff-able. Roadmap: in-product SBOM diff and CVE alert subscription.
11. Support model
- Free. community forum, docs, GitHub issues on the public repo. No SLA.
- Pro. email support at
[email protected]. First response within 1 business day. - Enterprise. dedicated Slack Connect or Teams channel, 24×7 for SEV1/SEV2, named CSM, quarterly architecture review.
Self-hosted customers get the same support tiers scoped to the control plane; we do not debug customers' Kubernetes clusters but we do give you the tools and logs to do it yourself.
13. Contact
For operations questions, the trust pack (SOC 2 summary, last penetration test), or anything else: [email protected]. Put [ops] or [security] in the subject line so we route it correctly.