White Paper ← All white papers

GrandLine Architecture Intelligence. Operations & Reliability White Paper

Revision: 2026-Q2 · Audience: SRE, platform engineering, and customer operations teams evaluating how we run GrandLine Architecture Intelligence and how we expect you to run the self-hosted edition.

The goal of this document is to tell you how we run GrandLine in production so you can judge whether we meet your reliability bar, and to tell you how to run the self-hosted edition so you can hit the same bar yourself.

1. Summary

GrandLine Architecture Intelligence is a metadata platform. It is not a workload proxy, not on the request path for anything you run in production, and not a dependency of any system that handles customer-facing traffic. This shapes everything about how we operate it:

Deployment primitives are Kubernetes + Helm (primary) and Docker Compose (local dev and small PoCs only). Observability is OpenTelemetry + Prometheus + Loki + Tempo + Grafana. Upgrades are blue-green with schema migrations gated on a canary tenant.


2. Service-level objectives

We publish three SLOs for the SaaS. They are also the SLOs we recommend for self-hosted operators.

SLO Target Measurement window Error budget
Dashboard availability. GET /api/v1/me returns 2xx within 1s99.9%30-day rolling43 min 49 sec / month
Scan completion. a connector scan started before the top of the hour finishes within 15 minutes (p95)99.0%30-day rolling7.3 hours / month
Report generation. a requested PDF/DOCX is ready within 2 minutes (p95)99.5%30-day rolling3.6 hours / month

Why not 99.99%? A four-nines target requires multi-region, active-active writes, and engineering investment we do not believe customers should pay for in this category. Read visibility that is down for 20 minutes a month is fine for architecture / FinOps use cases; request-path tools are a different product. We say so up front so customers can decide whether GrandLine is the right tool for them.

Error budget policy: when a 30-day window burns 50% of its budget, we halt all non-reliability work for the following sprint and run a post-incident review. When it burns 100%, the service owner pages and we freeze deploys until the budget recovers.


3. Deployment topology (SaaS)

Per region, we run:

Component Primitive Sizing
API (apps/api, NestJS)EKS deployment, 3+ replicas, HPA 3–20 on CPU+RPSsized per workload
Worker (apps/worker, BullMQ)EKS deployment, 6+ replicas, KEDA autoscaled on queue depthsized per workload
Dashboard (apps/dashboard, Next.js)EKS deployment behind CloudFrontsized per workload
PostgresAurora PostgreSQL 16, 1 writer + 2 readerssized per workload, gp3 storage, PITR on
RedisElastiCache Redis 7.2, single-shard cluster modesized per workload
S3 (reports + audit archive)1 bucket, Object Lock on audit/ prefix.
NAT3× NAT GW, one per AZ.
Load balancerALB + CloudFront for dashboard; ALB for API.

Network layout: 3 AZs, one private subnet per AZ for compute, one data-tier subnet per AZ for Aurora/Redis. Aurora and Redis are unreachable from the internet. The ALB is internet-facing and terminates TLS with a certificate from ACM; nothing else is reachable from the internet.

A separate AWS account. grandline-observability-<region>. holds the Loki, Tempo, and Grafana Cloud exporters. Production AWS accounts have IAM that allows them to write to the observability account but not read from it, so a compromised production role cannot tamper with its own audit or logs.


4. Deployment topology (self-hosted)

Self-hosted is shipped as a Helm chart: helm install grandline grandline/grandline. It does not require a specific cloud. We test on EKS, AKS, GKE, and vanilla kubeadm + MetalLB (for true air-gap installs). The chart expects:

Required:

Optional:

A small eval profile (values-eval.yaml) runs everything in one namespace with in-cluster Postgres and Redis on a single 4 vCPU / 8 GB node. Production profile (values-prod.yaml) assumes managed Postgres/Redis, sets resource requests / limits, and enables PodDisruptionBudgets.

4.1 Cloud-hosted self-hosted patterns

We do not assume a specific host cloud for the self-hosted edition. The chart works the same on all three:

Pattern What the customer runs Data path
AWS-hosted self-hostedEKS + Aurora Postgres + ElastiCache Redis + S3Scans hit AWS/Azure/GCP from inside the EKS VPC
Azure-hosted self-hostedAKS + Azure Database for PostgreSQL + Azure Cache for Redis + Azure Blob Storage (via S3 interop or MinIO gateway)Scans hit AWS/Azure/GCP from inside the AKS VNet
GCP-hosted self-hostedGKE + Cloud SQL Postgres + Memorystore Redis + Cloud Storage (GCS) (S3 interop)Scans hit AWS/Azure/GCP from inside the GKE VPC
Customer-managed KubernetesAny CNCF-certified cluster + any Postgres 16 + any Redis 7 + any S3-compatible object storeScans hit the customer's cloud APIs over egress they control

All four discover AWS, Azure, and GCP. the host cloud is orthogonal to the clouds you scan. You can run self-hosted on GKE and discover an AWS estate. Credentials for discovery are stored in the cluster's native secret store (K8s Secrets with a sealed-secrets or external-secrets controller. chart supports both).

4.2 Air-gapped install

For customers who cannot reach public registries:

  1. Mirror ghcr.io/grandline/ and the bitnami/ dependencies into the customer's private registry.
  2. Run helm pull grandline/grandline --untar and re-point image.registry in values.yaml to the mirror.
  3. The chart has no runtime calls to the public internet. License validation is offline for air-gapped Enterprise installs (signed license file, validated against a public key baked into the image).

5. Observability

GrandLine is instrumented with OpenTelemetry SDKs across the API, worker, and dashboard. Traces, metrics, and logs carry a common tenant_id (SaaS) or install_id (self-hosted) attribute.

5.1 Signals

5.2 Dashboards

The chart ships Grafana dashboards as ConfigMaps with a grafana_dashboard label. Dashboards:

5.3 Alerts

Alerts ship as PrometheusRule CRDs. The full list is in docs/16-observability.md; the ones tied to SLOs are:


6. Backup and restore

6.1 Postgres

6.2 Redis

Not backed up. Redis is BullMQ queue + short-lived cache. On loss, pending jobs are re-enqueued from the Postgres source of truth on the next scheduled scan.

6.3 S3 (reports + audit)

6.4 Restore drills

Every quarter we pick a random production tenant, restore its data to an isolated cluster, verify the dashboard, spot-check reports, then tear it down. The drill is tracked in a ticket, the outcome is published internally, and the summary is available to Enterprise customers under NDA.

6.5 Restore playbook (self-hosted)

The Helm chart includes a grandline-restore one-shot Job. Given a Postgres dump URL and an S3 archive URL, it:

  1. Stops the API and worker deployments.
  2. Restores the DB (pg_restore with --clean --if-exists).
  3. Syncs the S3 archive into the target bucket.
  4. Runs prisma migrate deploy to bring the schema to current.
  5. Starts a single worker pod to replay any pending BullMQ jobs.
  6. Scales API + worker back up.

End-to-end restore on a ~5 GB tenant dataset: ~25 minutes, measured.


7. Upgrades

SaaS upgrades happen weekly on Tuesdays at 14:00 UTC. The process:

  1. Canary. the new image ships to a single internal tenant and a volunteer cohort (3 customers) with a 24-hour soak.
  2. Blue-green. the new image is rolled to a parallel deployment, traffic is shifted 10%/50%/100% with automatic rollback if SLO burn rate exceeds 2×.
  3. Schema migration. run before API swap, always backward-compatible for one release (expand → contract pattern). A migration that drops a column lands one release after the code that stopped reading it.
  4. Feature flags. every user-visible behaviour change is behind a feature flag. Flags are rolled per tenant.

For self-hosted:


8. Incident response

We run a written incident-response process. Severity levels and response targets apply to Enterprise contracts. Post-mortems for incidents with customer impact are blameless and written within 5 business days, and are shared with affected Enterprise customers on request.


9. Capacity and scale

9.1 What "scale" looks like in practice

GrandLine is not on the request path. The scaling dimensions we care about are:

  1. Tenants. currently designed for tens of thousands per region. Tenant bootstrapping is a ~30-second job (provision customer managed key, create roles, seed rule catalogue).
  2. Resources per tenant. individual tenants with 500k+ resources have been exercised in load tests. The graph schema and cost-daily table scale with horizontal shards on tenantId hash; sharding is not needed before ~2M resources per tenant.
  3. Scans per hour. the worker fleet autoscales on queue depth (KEDA). A single worker handles ~3 concurrent tenant scans; more workers = more throughput.
  4. Reports per hour. report rendering is CPU-bound (ReportLab, python-docx). A separate report-worker pool scales independently.
  5. Diagram rendering. the layout (ELK) and the render (Cytoscape serverside or client-side, see the Diagram Quality technical note) are O(N log N) where N is nodes + edges. For > 1000 nodes in a single view we auto-split by account / VPC / tag. No single view is rendered with more than ~800 nodes.

9.2 Capacity planning

Each component has a published "unit of capacity" table:

These numbers are regenerated every major release from load tests. They land in docs/16-observability.md.


10. Known growth risks

We keep a living list. Today:

  1. Diagram rendering at very large tenants. Tenants with > 1M resources can hit memory pressure in layout. Mitigation: auto-split by account, then by VPC, then by tag, progressive rendering; roadmap item to offload layout to a GPU-backed ELK server.
  2. Report generation bursts. Hundreds of PDFs at once saturate CPU. Mitigation: separate report-worker pool with its own HPA, queue priority for interactive users over scheduled exports.
  3. Tenant isolation under compromise. A single compromised operator session could read multiple tenants within the 60-minute break-glass window. Mitigation: per-tenant customer managed keys mean the operator has to Decrypt N times, each logged, each visible to the tenant's audit stream.
  4. Noisy neighbours on Aurora. A single tenant running very expensive custom-rule queries can starve others. Mitigation: per-tenant statement timeouts, per-tenant connection-pool quotas, Aurora db.r7g.2xlarge headroom of 50%.
  5. Billing and metering complexity. Resources-based metering is simple to sell but complicated at edge cases (transient resources, decommissioned accounts). Mitigation: we meter on the high-water mark of distinct Resource.id seen in the period, documented clearly; customers can query their own counter via the API.
  6. Support burden as tenant count grows. Mitigation: Free tier is self-serve only, Pro tier is email, Enterprise tier is dedicated channel; internal tooling for operators scales linearly with headcount.
  7. Supply-chain visibility. CycloneDX SBOMs are published but not easily diff-able. Roadmap: in-product SBOM diff and CVE alert subscription.

11. Support model

Self-hosted customers get the same support tiers scoped to the control plane; we do not debug customers' Kubernetes clusters but we do give you the tools and logs to do it yourself.


13. Contact

For operations questions, the trust pack (SOC 2 summary, last penetration test), or anything else: [email protected]. Put [ops] or [security] in the subject line so we route it correctly.