Architecture & Engineering Deliverables
The target cloud-native platform and the consolidated v1 actually shipped, with the migration path — plus the data model, ETL, ML/forecasting, API, security, and roadmap.
System Architecture
CognitiveCoefficient — System Architecture
> The Bloomberg Terminal / OECD.AI / World Bank Observatory for the intelligence economy. > CognitiveCoefficient (CC) measures, indexes, and forecasts the diffusion of machine intelligence across countries, subnational regions, cities, companies, industries, and occupations — and lets users assess their own exposure and simulate scenarios.
This document describes (A) the full-scale target architecture, (B) the pragmatic v1 consolidated architecture actually shipped, and (C) the migration path between them.
---
A. Target Architecture (full scale, cloud-native)
A.0 Architectural principles
- Read-mostly, analytics-heavy. The product is a data observatory; 95%+ of traffic is reads over precomputed indices, time series, and forecasts. Architecture optimizes for cheap fan-out reads and batch recompute, not OLTP write throughput.
- Warehouse-of-record, serving-cache-of-convenience. A columnar analytical warehouse is the source of analytical truth; Postgres holds only relational/operational state (users, keys, billing, saved scenarios, metadata catalog).
- Reproducible indices. Every published index value is a pure function of (raw indicator values × methodology version × weights). Recompute is deterministic and versioned; a value can always be re-derived and audited.
- Methodology as code + as data. Index definitions, weights, normalization rules, and forecast model configs are versioned artifacts, not ad-hoc SQL.
- Tiered access from day one. Free/Pro/Enterprise/API/Government/University tiers are enforced at the gateway, not bolted on later.
A.1 Logical layers
┌──────────────────────────────────────────────┐
Browser / Terminal ──▶│ Edge: CDN + WAF + TLS (Cloudflare/Fastly) │
└───────────────┬──────────────────────────────┘
▼
┌────────────────────────────────────────────────────────┐
Frontend │ Next.js / React / TypeScript (App Router, RSC, SSR/ISR)│
(Vercel or │ - Terminal UI, map/geo explorer, charting (visx/ECharts)│
K8s pods) │ - Career/assessment wizard, scenario simulator UI │
└───────────────┬───────────────────────┬─────────────────┘
▼ ▼
┌───────────────────────────┐ ┌────────────────────────────┐
API Gateway│ Kong / APISIX / Envoy │ │ GraphQL Gateway (Apollo/ │
│ - authN (JWT, API keys) │ │ Strawberry) federated │
│ - rate limit, quotas, │ └────────────────────────────┘
│ tier routing, billing │
└───────────────┬───────────┘
▼
┌────────────────────────────────────────────────────────────────────┐
│ Service mesh (Python / FastAPI microservices, async, Pydantic v2) │
│ │
│ geo-svc company-svc industry-svc occupation-svc │
│ index-svc forecast-svc scenario-svc assessment-svc │
│ career-svc export-svc catalog-svc auth/billing-svc │
└───────┬───────────────────┬────────────────────────┬───────────────┘
▼ ▼ ▼
┌──────────────────┐ ┌────────────────────┐ ┌───────────────────────────┐
│ Postgres (OLTP) │ │ ClickHouse (OLAP │ │ DuckDB (embedded analytics│
│ users, api_keys, │ │ serving): indicator│ │ + per-request slices, │
│ billing, scenarios│ │ _values, indices, │ │ notebook/export jobs, │
│ metadata catalog │ │ forecasts, panels │ │ small-table joins) │
└──────────────────┘ └────────────────────┘ └───────────────────────────┘
▲ ▲ ▲
│ │ Lakehouse (Parquet on S3/R2, Iceberg) │
│ └─────────────┬──────────┘ │
│ ▼ │
┌────────────────────────────────────────────────────────────────────┐
│ ETL / orchestration: Apache Airflow (KubernetesExecutor) │
│ - 40+ source DAGs (World Bank, IMF, OECD, BLS, OECD.AI, AI Index, │
│ HuggingFace, GitHub, arXiv, O*NET, WIPO, Top500…) │
│ - dbt (warehouse transforms), Great Expectations (validation) │
│ - model training/scoring DAGs (Prophet/XGBoost/PyMC) │
└────────────────────────────────────────────────────────────────────┘
A.2 Frontend
- Next.js 14+ (App Router) in TypeScript. React Server Components for data-dense pages (country/company/industry profiles render server-side against the API), ISR for index/leaderboard pages (revalidate on each ETL publish), client components for interactive widgets.
- Visualization: ECharts + visx for time series/forecast fans; deck.gl/MapLibre for the geo explorer (countries → subnationals → cities choropleths); a "terminal" command palette (cmd-k) for power users à la Bloomberg.
- State/data fetching: TanStack Query against the REST API; SWR for the GraphQL gateway. Auth via NextAuth/Auth.js with the platform's own OIDC.
- Design system: shared component library, dark "terminal" theme primary.
A.3 Backend services (FastAPI/Python)
Each domain is an independently deployable FastAPI service (Python 3.12, async, Pydantic v2 models, Uvicorn/Gunicorn workers, served behind the gateway):
| Service | Responsibility |
|---|---|
geo-svc | country / state-province / city entities, profiles, rankings |
company-svc | company profiles, AI-adoption indicators, peer sets |
industry-svc | industry (NAICS/ISIC) panels, exposure scores |
occupation-svc | occupation (O*NET/SOC) automation/augmentation exposure |
index-svc | index definitions, weights, on-the-fly + materialized index values |
forecast-svc | serves model forecasts + uncertainty bands; triggers retrains |
scenario-svc | parametric + Monte-Carlo scenario simulation engine |
assessment-svc | personal CC-exposure assessments (PII-bearing) |
career-svc | career path / reskilling recommendations |
export-svc | bulk CSV/Parquet/JSON exports, async export jobs |
catalog-svc | data dictionary, source provenance, methodology registry |
auth/billing-svc | tenants, API keys, tiers, quotas, Stripe metering |
Cross-cutting: shared cc-core Python package (Pydantic schemas, warehouse client, auth middleware, telemetry).
A.4 Data stores
- Postgres 16 (managed, e.g. RDS/Cloud SQL/Neon) — OLTP:
users,api_keys,subscriptions,scenarios(saved),assessmentsmetadata,audit_log, and the metadata catalog (indicator definitions, sources, methodology versions). Logical replication to the warehouse for catalog joins. - ClickHouse — primary OLAP serving store:
indicator_values,indices,index_values,forecasts, wide country/industry/occupation panels. MergeTree tables partitioned by(geo_level, year); materialized views precompute rankings and YoY deltas. Powers sub-100ms reads for the terminal. - DuckDB — embedded engine inside
export-svcand analyst/notebook workloads; reads Parquet directly from the lakehouse for ad-hoc joins and per-request scenario slices without hitting ClickHouse. - Lakehouse — Parquet on S3/R2 with Apache Iceberg table format; raw → staged → curated zones. The immutable raw zone is the audit substrate for reproducibility.
- Redis — gateway rate-limit counters, hot-response cache, Celery/RQ broker for async jobs.
A.5 Warehouse abstraction
A single WarehouseClient interface (in cc-core) abstracts the analytical backend so services are storage-agnostic:
class WarehouseClient(Protocol):
def query(self, sql: str, params: dict) -> pyarrow.Table: ...
def panel(self, entity, indicators, years) -> pyarrow.Table: ...
Implementations: DuckDBClient (v1, embedded), ClickHouseClient (scale), LakehouseClient (Parquet/Iceberg). The same SQL dialect surface (ANSI subset + standard window funcs) runs across DuckDB and ClickHouse, so the migration is a config flip, not a rewrite.
A.6 ETL / orchestration
- Apache Airflow (KubernetesExecutor) runs one DAG per source plus transform/index/forecast DAGs (see
etl_architecture_md). - dbt for warehouse transforms (raw → staging → marts), Great Expectations for validation gates, Soda/anomaly jobs for drift.
A.7 Infrastructure & platform
- Containers: every service is a Docker image (distroless/slim Python base), built in CI, scanned (Trivy), signed (cosign).
- Orchestration: Kubernetes (EKS/GKE or k3s on the homelab tier). HPA on the read services; KEDA scales Airflow workers on queue depth.
- IaC: Terraform for cloud (VPC, managed PG, object storage, ClickHouse Cloud/Altinity, DNS, secrets) + Helm charts per service. GitOps via ArgoCD.
- CI/CD: GitHub Actions → build/test/scan → push → ArgoCD sync. Blue/green for the gateway, rolling for services.
- Observability: OpenTelemetry traces → Tempo/Jaeger; Prometheus + Grafana metrics; Loki logs; Sentry for app errors;
index-svc/forecast-svcemit data-freshness and model-drift metrics. - Environments:
dev(homelab/k3s),staging,prod.
---
B. v1 Consolidated Architecture (what actually shipped)
The deployed v1 is a single consolidated FastAPI application — deliberately one process, one repo, one deploy. It is the pragmatic compression of the target diagram into a monolith that preserves the seams needed to split later.
cognitivecoefficient/
└── app/
├── main.py # FastAPI app: mounts routers, static, pages
├── pages/ # server-rendered Jinja/HTML pages (the "terminal")
├── static/
│ ├── css/ js/ vendor/ # frontend assets, charting vendored client-side
├── data/ # DuckDB file(s) + curated Parquet/CSV snapshots
├── routers/ # /country /state /city /company /industry
│ # /occupation /index /forecast /scenario
│ # /assess /career /simulate
├── services/ # geo, index, forecast, scenario, assessment logic
│ # = the future microservice boundaries, as modules
├── models/ # Pydantic schemas (shared with future cc-core)
└── ml/ # fitted forecast models + scenario engine
Key v1 properties:
- One FastAPI process serves both the HTML pages (
app/pages,app/static) and the JSON API (app/routers/*). No separate Next.js frontend yet — the "terminal" is server-rendered HTML + vendored JS charts. This is theapp/{pages,static}layout actually on disk. - DuckDB as the only analytical engine, embedded, reading the curated Parquet/CSV in
app/data/. No ClickHouse, no separate warehouse cluster. Indices and forecasts are precomputed offline and shipped as data files; the API serves slices of them. - SQLite/Postgres-lite for operational state (users, api_keys, saved scenarios) — a single relational DB; in the smallest deploy this can be a DuckDB/SQLite file, with a clean swap to Postgres.
- ETL is offline batch scripts, not Airflow — Python collectors run on a schedule (cron/ctask) producing the Parquet snapshots committed/synced into
app/data/. The DAG logic exists as ordered functions, ready to lift into Airflow tasks. - ML is "train offline, serve artifacts." Prophet/XGBoost/PyMC models are fit in notebooks/batch, serialized, and loaded by
app/ml; the/forecastand/simulateendpoints serve precomputed + light on-request computation (Monte-Carlo runs bounded for latency). - Deploy: single container (or static-export + FastAPI) behind Caddy on netops01 at
www.cognitivecoefficient.com(apex → www redirect, TLS terminated by Caddy). One image,docker compose up. - Auth/tiers: API-key middleware in-process; tier limits enforced with Redis or an in-memory/SQLite token bucket.
This is enough to be live, correct, and credibly "Bloomberg-for-AI" on day one, at a fraction of the operational cost.
---
C. Migration Path (v1 → target)
The consolidation was designed to decompose along pre-cut seams, in dependency order:
| Step | Move | Trigger | How (because seams already exist) |
|---|---|---|---|
| 1 | Postgres for OLTP | first paying tenants / multi-instance | Swap the relational DB URL; models/ Pydantic + repo layer already DB-agnostic. |
| 2 | Split the frontend to Next.js | need rich interactive terminal/maps | Stand up Next.js consuming the same JSON routers; retire app/pages. API contract unchanged. |
| 3 | Airflow for ETL | sources/schedules outgrow cron | Wrap existing collector functions as Airflow tasks; output still Parquet → lakehouse. |
| 4 | Lakehouse (Parquet/Iceberg on object store) | data volume / reproducibility/audit needs | Point the warehouse client's raw/curated paths at S3/R2 instead of app/data/. |
| 5 | ClickHouse serving | read latency/concurrency at scale | Implement ClickHouseClient behind the existing WarehouseClient interface; flip config per env. DuckDB stays for exports. |
| 6 | Carve out services (forecast-svc, scenario-svc, assessment-svc first — the compute/PII-heavy ones) | independent scaling / isolation (esp. PII) | Each services/<x> module + its router becomes a FastAPI service importing cc-core. Gateway routes by path. |
| 7 | API Gateway + GraphQL | external API customers, federation | Move key/tier/rate-limit out of the monolith into Kong/APISIX; add Apollo/Strawberry GraphQL over the REST services. |
| 8 | K8s + Terraform + ArgoCD | reliability/SLA commitments | Containers already exist; add Helm/Terraform/GitOps. |
Invariant across the whole path: the public API contract (/country, /index, /forecast, …), the Pydantic schemas, and the WarehouseClient interface never change. v1 monolith and target mesh are the same API over a swappable substrate — which is what makes "consolidated FastAPI now, cloud-native later" a refactor rather than a rebuild.
Data Model
CognitiveCoefficient — Data Model
The data model has three concerns: (1) the analytical core (geographies, organizations, work, indicators, indices, forecasts, scenarios) — read-mostly, columnar-friendly; (2) operational state (users, api_keys, billing, saved scenarios) — relational/OLTP; (3) the metadata catalog (sources, methodology, provenance) — the glue that makes every value auditable.
Naming: snake_case; surrogate id keys (ULID/UUIDv7) for OLTP, stable natural codes (ISO/SOC/NAICS) for analytical dimensions. All fact tables carry source_id, methodology_version, and ingested_at for provenance.
---
1. Geographic dimension
countries
| field | type | notes |
|---|---|---|
country_id | text PK | ISO 3166-1 alpha-3 (USA, DEU) |
iso2 | text | ISO 3166-1 alpha-2 |
name | text | canonical English name |
region | text | UN/World Bank region |
subregion | text | |
income_group | text | World Bank: Low/Lower-mid/Upper-mid/High |
oecd_member | bool | |
population | bigint | latest |
gdp_usd | numeric | latest, nominal |
gdp_per_capita_ppp | numeric | |
centroid_geom | geometry | for map rendering |
cc_rank | int | current CognitiveCoefficient global rank |
subnationals (states/provinces)
| field | type | notes |
|---|---|---|
subnational_id | text PK | ISO 3166-2 (US-CA, DE-BY) |
country_id | text FK → countries | |
name, type | text | "state","province","region" |
population, gdp_usd | numeric | |
cc_rank_national | int | rank within country |
cities
| field | type | notes |
|---|---|---|
city_id | text PK | GeoNames id or country-slug |
subnational_id | text FK | nullable |
country_id | text FK | |
name, metro_name | text | |
population, lat, lon | numeric | |
is_tech_hub | bool | |
cc_rank_global | int | city-level CC rank |
---
2. Organization & work dimensions
companies
| field | type | notes |
|---|---|---|
company_id | text PK | ULID; ticker/lei as natural keys |
name, legal_name | text | |
ticker, lei, domain | text | identifiers |
hq_country_id, hq_city_id | FK | |
industry_id | FK → industries | primary |
employee_count, market_cap_usd | numeric | |
founded_year | int | |
is_ai_native | bool | |
ai_adoption_score | numeric | 0–100 composite |
cc_rank_industry | int |
industries
| field | type | notes | |
|---|---|---|---|
industry_id | text PK | NAICS or ISIC code | |
code_system | text | NAICS \ | ISIC |
name, sector | text | ||
parent_industry_id | text FK | self-ref hierarchy | |
level | int | 2/3/4-digit depth | |
automation_exposure | numeric | 0–1 | |
augmentation_potential | numeric | 0–1 | |
cc_industry_index | numeric |
occupations
| field | type | notes |
|---|---|---|
occupation_id | text PK | O*NET-SOC code (15-1252.00) |
soc_code | text | 6-digit SOC |
title, description | text | |
job_zone | int | O*NET 1–5 |
median_wage_usd | numeric | BLS OEWS |
employment | bigint | BLS |
automation_exposure | numeric | 0–1 (task-level rollup) |
augmentation_exposure | numeric | 0–1 |
ai_complementarity | numeric | -1..1 (substitute↔complement) |
cc_exposure_index | numeric | headline per-occupation CC |
growth_outlook | text | BLS projection band |
occupation_tasks (bridge): occupation_id, task_id, task_desc, importance, ai_exposure — the task-level substrate that rolls up into occupation exposure.
---
3. Indicators (the raw measurement layer)
indicators (definition / dictionary)
| field | type | notes | |||||
|---|---|---|---|---|---|---|---|
indicator_id | text PK | e.g. ai_patents_per_capita | |||||
name, description | text | ||||||
unit | text | %, USD, count, index | |||||
entity_level | text | country\ | subnational\ | city\ | company\ | industry\ | occupation |
source_id | FK → sources | primary source | |||||
direction | text | higher_better\ | lower_better | ||||
frequency | text | annual\ | quarterly\ | monthly | |||
min_year, max_year | int | coverage | |||||
methodology_url | text |
indicator_values (the fact table — largest)
| field | type | notes |
|---|---|---|
indicator_id | FK | |
entity_level | text | partition key |
entity_id | text | resolves to the right dimension by level |
year | int | (or period for sub-annual) |
value | numeric | raw measured value |
value_normalized | numeric | 0–100 min-max/z-score (per methodology) |
source_id | FK | |
methodology_version | text | |
confidence | numeric | 0–1 data-quality score |
is_imputed | bool | flag for filled gaps |
ingested_at | timestamp |
PK: (indicator_id, entity_level, entity_id, year, methodology_version). In ClickHouse: MergeTree PARTITION BY (entity_level, year) ORDER BY (indicator_id, entity_id).
---
4. Indices (the published products)
indices (index definitions)
| field | type | notes | ||
|---|---|---|---|---|
index_id | text PK | cc_global, cc_industry, cc_occupation | ||
name, description | text | |||
entity_level | text | what it ranks | ||
methodology_version | text | current published version | ||
pillars | jsonb | named sub-pillars | ||
weights | jsonb | {indicator_id: weight} | ||
normalization | text | minmax\ | zscore\ | rank |
aggregation | text | weighted_mean\ | geometric | |
published_at | timestamp |
index_values
| field | type | notes |
|---|---|---|
index_id | FK | |
entity_level, entity_id | ||
year | int | |
score | numeric | 0–100 headline |
rank_global, rank_peer | int | |
percentile | numeric | |
pillar_scores | jsonb | breakdown |
methodology_version | text | reproducibility |
---
5. Forecasts & scenarios
forecasts
| field | type | notes | |||||
|---|---|---|---|---|---|---|---|
forecast_id | text PK | ||||||
target_type | text | indicator\ | index | ||||
target_id | text | which indicator/index | |||||
entity_level, entity_id | |||||||
model | text | prophet\ | xgboost\ | bass\ | logistic\ | gompertz\ | pymc |
model_version | text | ||||||
horizon_years | int | ||||||
trained_at | timestamp | ||||||
backtest_mape | numeric | accuracy metadata |
forecast_values
| field | type | notes |
|---|---|---|
forecast_id | FK | |
year | int | future period |
point | numeric | median/expected |
p05,p25,p50,p75,p95 | numeric | uncertainty bands |
is_historical | bool | fitted vs projected |
scenarios
| field | type | notes | ||
|---|---|---|---|---|
scenario_id | text PK | ULID | ||
owner_user_id | FK → users | null for system presets | ||
name, description | text | |||
base_target_id | text | indicator/index forecast it perturbs | ||
assumptions | jsonb | levers: adoption_rate, policy_shock, compute_growth, shock_year… | ||
method | text | parametric\ | monte_carlo\ | bayesian |
n_draws | int | MC sample size | ||
is_public | bool | |||
created_at | timestamp |
scenario_results (cached runs): scenario_id, year, p05..p95, point, run_id, computed_at.
---
6. Operational state (OLTP / Postgres)
users
| field | type | notes | ||||
|---|---|---|---|---|---|---|
user_id | text PK | UUIDv7 | ||||
email | citext UNIQUE | |||||
password_hash | text | argon2id (null if SSO-only) | ||||
display_name | text | |||||
org_id | FK → orgs | tenant | ||||
tier | text | free\ | pro\ | enterprise\ | gov\ | university |
role | text | member\ | admin\ | owner | ||
email_verified_at, created_at, last_login_at | timestamp | |||||
mfa_enabled | bool |
orgs (tenants)
org_id PK, name, tier, seat_limit, data_residency, stripe_customer_id, created_at.
api_keys
| field | type | notes |
|---|---|---|
api_key_id | text PK | |
org_id, created_by_user_id | FK | |
key_prefix | text | shown in UI (cc_live_ab12…) |
key_hash | text | sha256/argon2 of full key — secret never stored |
scopes | text[] | read:index, read:forecast, run:simulate, read:assess… |
tier | text | for rate-limit class |
rate_limit_rpm | int | override |
last_used_at, expires_at, revoked_at | timestamp |
subscriptions
subscription_id, org_id, plan, status, stripe_subscription_id, current_period_end, metered_usage jsonb.
assessments (personal — PII-bearing, isolated)
| field | type | notes |
|---|---|---|
assessment_id | text PK | |
user_id | FK (nullable for anon) | |
occupation_id | FK | self-reported role |
inputs | jsonb (encrypted at rest) | skills, tasks, region — PII |
cc_personal_score | numeric | their exposure result |
recommendations | jsonb | career-svc output |
created_at | timestamp | |
retention_expires_at | timestamp | TTL for auto-purge |
audit_log
audit_id, actor_user_id, org_id, action, target, ip, user_agent, at, metadata jsonb — append-only.
---
7. Metadata catalog (provenance)
sources
| field | type | notes |
|---|---|---|
source_id | text PK | world_bank, oecd_ai, bls, onet, wipo, top500, hf, github, arxiv |
name, org, url | text | |
license | text | redistribution terms |
update_frequency | text | |
last_ingested_at | timestamp | freshness |
reliability_tier | int | 1 (official stats) → 3 (scraped) |
methodology_versions
methodology_version PK, index_id, changelog, effective_date, weights snapshot, author — every published number references the exact methodology that produced it, enabling deterministic re-derivation and audit.
---
Entity-relationship summary
countries 1─* subnationals 1─* cities countries 1─* companies *─1 industries (self-hierarchy) occupations *─* occupation_tasks indicators 1─* indicator_values ──┐ indices 1─* index_values ├─ all FK source_id → sources forecasts 1─* forecast_values │ + methodology_version scenarios 1─* scenario_results ──┘ orgs 1─* users 1─* api_keys ; orgs 1─* subscriptions users 1─* scenarios ; users 1─* assessments(PII) everything analytical → sources, methodology_versions (provenance)
Data-Collection (ETL)
CognitiveCoefficient — ETL / Data-Collection Engine
The ETL engine is the moat: a continuously-updated, validated, normalized, fully-provenanced panel of the intelligence economy assembled from ~13 authoritative sources. The mantra: never trust a number without a source, a method, and a freshness stamp.
---
1. Sources
| Source | What it provides | Entity level | Cadence | Access |
|---|---|---|---|---|
| World Bank (WDI/Indicators API) | GDP, population, R&D %GDP, education, internet/infra | country/subnational | annual | REST/JSON |
| IMF (SDMX / IFS, WEO) | macro, productivity, GDP forecasts | country | quarterly/annual | SDMX |
| OECD (SDMX) | productivity, ICT, skills, R&D, business demography | country | annual | SDMX |
| OECD.AI Policy Observatory | AI policies, AI investment, AI jobs, compute | country | irregular | API/CSV |
| Stanford AI Index | model counts, training compute, papers, $ investment, benchmarks | country/global | annual | CSV/report |
| BLS (OEWS, Employment Projections) | wages, employment, occupation projections | occupation | annual | API/flat files |
| US Census / ACS | industry employment, business formation, demographics | subnational/city/industry | annual | API |
| **O*NET** | task/skill/ability content per occupation | occupation/task | semi-annual | DB download |
| WIPO (PATENTSTAT / IP stats) | AI patent filings/grants | country/company | annual | bulk/API |
| HuggingFace Hub | model/dataset counts, downloads, org activity | company/country (by org) | continuous | API |
| GitHub | AI-repo activity, stars, contributors, language trends | company/country | continuous | GraphQL API |
| arXiv | AI/ML paper volume, authorship, affiliations | country/company | continuous | OAI-PMH/API |
| Top500 / Green500 | supercomputer/compute capacity by country/org | country/company | semi-annual | scrape/CSV |
(Extensible: Crunchbase/PitchBook for funding, LinkedIn-style talent feeds, Epoch AI compute, etc. as licensing allows.)
---
2. Orchestration — Airflow (target)
One source DAG per provider, plus transform/index/forecast DAGs, on a global schedule:
world_bank_ingest @yearly + monthly freshness check imf_ingest @quarterly oecd_sdmx_ingest @yearly oecd_ai_ingest @weekly stanford_ai_index_ingest @yearly (+ manual on release) bls_ingest @yearly census_acs_ingest @yearly onet_ingest @monthly wipo_ingest @yearly huggingface_ingest @daily github_ingest @daily arxiv_ingest @daily top500_ingest @monthly │ ▼ (all land raw → lakehouse raw zone) transform_normalize_dag @daily (dbt: raw→staging→marts, normalization) build_indices_dag triggered on new marts (index-svc methodology) train_forecasts_dag triggered on new indices (ML DAGs) publish_dag loads ClickHouse serving tables + invalidates ISR cache data_quality_dag Great Expectations gates + anomaly scan (every run)
Each source DAG: extract → land_raw → validate → normalize → upsert_staging. KubernetesExecutor scales workers; KEDA on queue depth. Backfills are date-partitioned and idempotent. Secrets (API tokens) from the secrets manager, never in code.
v1 shipped: these are ordered Python collector scripts run by cron/ctask, each emitting curated Parquet/CSV into app/data/; the DAG topology above is the lift-and-shift target. Same extract/validate/normalize functions, wrapped as Airflow tasks later.
---
3. Pipeline stages
3.1 Extract & land (raw zone)
Pull via each source's native protocol (REST/SDMX/OAI-PMH/GraphQL/bulk). Land raw, immutable, partitioned by source/ingest_date in object storage (Parquet) — this raw zone is the audit substrate and enables full re-derivation. Capture HTTP metadata, response hash, and row counts.
3.2 Validate (gate)
Great Expectations suites per source: schema/type conformance, value ranges (e.g. shares ∈ [0,1], wages > 0), referential integrity (entity codes resolve to dimensions), expected row-count/coverage. A failed suite quarantines the batch (does not publish) and alerts — bad data never reaches the warehouse.
3.3 Normalize & conform
- Entity resolution: map every source's geo/company/industry/occupation identifiers to canonical keys (ISO-3166, NAICS/ISIC, ONET-SOC, internal
company_id). Crosswalk tables maintained in the catalog (e.g. SOC↔ONET, NAICS↔ISIC, GitHub-org↔company). - Unit & currency harmonization: convert to common units, deflate/PPP-adjust where needed.
- Indicator normalization: min-max → 0–100 and/or z-score per methodology, with
direction(higher/lower-better) applied. Outputindicator_values.value+value_normalized. - Temporal alignment: annualize/interpolate sub-annual series to the common panel grid; flag interpolated points.
- Imputation: model-based gap-fill (XGBoost) for sparse-but-needed indicators, always written with
is_imputed = trueand a confidence score.
3.4 Load & publish
dbt builds staging→marts (the conformed panel); build_indices_dag computes index_values from current methodology_version; serving tables loaded to ClickHouse (v1: DuckDB files in app/data/); CDN/ISR cache invalidated so the terminal shows fresh numbers.
---
4. Data quality & anomaly detection
- Validation gates (above) block structurally-bad batches.
- Statistical anomaly detection on every refresh: year-over-year jumps beyond entity-specific thresholds, outliers vs. peer distribution (robust z / IQR), break-point detection on long series, and cross-source disagreement checks (e.g. World Bank vs. IMF GDP must agree within tolerance). Anomalies are flagged, not silently dropped — surfaced to a review queue and lower the
confidencescore on affected values. - Confidence scoring: every
indicator_valuecarriesconfidence(0–1) derived from sourcereliability_tier, recency, imputation flag, and anomaly status — this propagates into index uncertainty. - Freshness SLAs:
sources.last_ingested_atmonitored; stale sources alert and the catalog/UI badge the data age. Lineage (raw → staging → mart → index → forecast) is tracked end-to-end so any published number traces to its raw rows, source, and methodology version. - Observability: each DAG emits row counts, null rates, validation pass/fail, and runtime to Prometheus/Grafana; failures page via the alerting channel.
Forecasting & ML
CognitiveCoefficient — Forecasting & ML Architecture
The ML layer answers two questions the indices alone cannot: "where is each entity headed?" (forecasting) and "what if?" (scenario simulation). Both are first-class, uncertainty-quantified products, not decorations on a chart.
Design stance: technology diffusion is the right prior. AI adoption across countries/industries/occupations follows S-curves, so the model library is anchored on diffusion theory (Bass, logistic, Gompertz) and augmented with statistical/ML models (Prophet, XGBoost) and a Bayesian core (PyMC) for principled uncertainty.
---
1. Model library & roles
1.1 Diffusion / adoption curves (structural backbone)
- Bass diffusion — the workhorse for adoption of a technology in a population (share of firms/workers using AI, share of tasks automated). Parameters:
p(innovation/external influence),q(imitation/word-of-mouth),m(market potential / ceiling). Fit by NLS or Bayesian; gives interpretable, bounded, theory-grounded trajectories. - Logistic (Verhulst) — symmetric S-curve for saturating indicators (compute density, model-capability proxies) where inflection is near the midpoint.
- Gompertz — asymmetric S-curve (slow start, fast middle, long tail) — better for capability/penetration metrics that decelerate late. We fit all three and select by backtest + information criteria per series.
- Role: primary for bounded, saturating indicators and indices (adoption %, exposure share). Their parameters are also the levers the scenario engine perturbs.
1.2 Prophet
- Role: fast, robust forecasting for indicators with trend + seasonality + holidays/regime shifts and irregular history (many official-stat series are annual/quarterly with gaps). Used as a strong baseline and for series without a clear saturation ceiling. Native uncertainty intervals; decomposable (trend/seasonal components shown in the terminal).
1.3 XGBoost (gradient-boosted trees)
- Role: cross-sectional + panel forecasting where an entity's trajectory depends on covariates — e.g. predict a country's next-year CC pillar from its GDP, R&D spend, patent flow (WIPO), GitHub/HF activity, talent metrics. Captures nonlinear interactions across the rich indicator feature space. Trained on the panel (entity × year × indicators); quantile objective (
reg:quantileerror) yields p05/p50/p95 directly. Also powers gap-filling/imputation of sparse indicators (withis_imputedflagged).
1.4 PyMC (Bayesian core)
- Role: the principled-uncertainty engine. Three uses:
- Bayesian Bass/logistic — priors on
p,q,m(informed by peer entities / hierarchical pooling), full posterior over curve parameters → posterior predictive trajectories with honest bands, even with short history. - Hierarchical models — partial pooling across entities (countries within a region, occupations within a family) so data-poor entities borrow strength from data-rich peers.
- Scenario priors — analyst assumptions enter as priors/interventions, propagated to the posterior.
- MCMC (NUTS) for inference; convergence reported (
r_hat, ESS) and surfaced in/simulateresponses.
1.5 Model selection
Per (target, entity) we maintain a small ensemble/championship: fit Bass/logistic/Gompertz/Prophet/XGBoost, backtest (rolling-origin), and pick by out-of-sample error (MAPE/MASE) with a structural-model tie-break (prefer diffusion models when fit is comparable, for interpretability and lever-ability). The chosen model + model_version is stored on every forecast row.
---
2. Scenario simulation
Two modes, both exposed via /scenario and /simulate:
2.1 Parametric / deterministic
Analyst sets levers (adoption rate p,q, ceiling m, a policy_shock at year t, a compute_growth multiplier). Engine perturbs the fitted diffusion parameters and re-projects a single trajectory + analytically-propagated band. Fast, used for the interactive sliders in the terminal.
2.2 Monte-Carlo
Sample parameters/shocks from their distributions (n_draws 1k–50k by tier), propagate each through the curve, and aggregate to a fan (p05…p95). Captures interaction of multiple uncertain levers (e.g. uncertain q × stochastic policy shock). Vectorized (NumPy) for latency; large runs go async via the job queue and cache to scenario_results.
2.3 Bayesian scenarios
The PyMC posterior is the scenario distribution: interventions are applied with pm.do/pm.Intervention-style do-operations, and the posterior predictive gives the counterfactual fan with full parameter uncertainty. This is the rigorous tier (Enterprise/Government) for policy analysis.
---
3. Uncertainty quantification (non-negotiable)
Every forecast/scenario ships bands, not just points:
- Bayesian models → posterior predictive quantiles.
- XGBoost → quantile regression (p05/p25/p50/p75/p95).
- Prophet → built-in intervals (with
mcmc_samplesfor full Bayesian intervals on key series). - Diffusion NLS → bootstrap/delta-method bands.
Bands are stored (forecast_values.p05..p95), rendered as fans in the UI, and returned in the API. We also publish backtest accuracy (backtest_mape) alongside each forecast so users can weight it — honesty about uncertainty is a product differentiator vs. point-estimate competitors.
---
4. Training, serving, and the v1 path
- Target: training runs as Airflow DAGs (one per model-family/target) on a schedule after ETL publishes new data; artifacts (fitted params, serialized models, posterior summaries) versioned in a model registry (MLflow).
forecast-svcloads artifacts and serves;scenario-svcruns Monte-Carlo/Bayesian on demand. Heavy training on GPU/CPU workers (the homelab 6×3090 tier is ideal for PyMC/XGBoost sweeps). - v1 shipped: train offline, serve artifacts. Models are fit in batch/notebooks, serialized into
app/ml/, and the consolidated FastAPI app loads them at startup./forecastserves precomputedforecast_values;/simulateruns bounded Monte-Carlo in-process (draw caps + async fallback) so latency stays sane without a separate service. Same model code, same artifacts — only the orchestration and serving topology change on the way to target.
---
5. Model monitoring & governance
- Drift detection: PSI/KS tests on input indicator distributions per ETL run; when inputs drift beyond threshold, flag for retrain.
- Forecast accuracy tracking: as actuals arrive, compute realized error vs. prior forecasts → rolling MAPE/MASE dashboards per model/entity; auto-demote a champion if a challenger beats it.
- Anomaly guardrails: reject/quarantine forecasts that violate structural constraints (e.g. adoption > ceiling, negative counts).
- Reproducibility:
model_version+ inputmethodology_version+ raw-data snapshot pin every forecast; any published number can be re-derived. - Reporting: convergence diagnostics (
r_hat, ESS) surfaced to users for Bayesian runs; backtest cards published in the catalog. Metrics flow to Prometheus/Grafana; retraining and demotions are logged toaudit_log.
API Specification
Cognitive Capital API — Specification
Base URL: https://api.cognitivecoefficient.com/v1 Content type: application/json. All timestamps ISO-8601 UTC. All money in USD unless noted. The same API backs the web terminal (v1: served by the consolidated FastAPI app) and external customers.
---
Authentication
Two mechanisms:
- API key (machine): header
Authorization: Bearer cc_live_<key>orX-API-Key: cc_live_<key>. Keys are tier-scoped and scope-limited (read:index,read:forecast,run:simulate,read:assess…). - OIDC/JWT (browser/session): short-lived access token from the platform's auth service; refresh via cookie. Used by the web terminal.
Keys are shown once on creation; only a hash is stored. Test keys use the cc_test_ prefix and read frozen fixture data.
401 unauthenticated · 403 scope/tier denied · 429 rate-limited
---
Tiers & rate limits
| Tier | Requests/min | Daily cap | Concurrency | Forecast horizon | Bulk export | Scopes |
|---|---|---|---|---|---|---|
| Free | 30 | 1,000 | 2 | ≤3 yrs | — | read:* (current year + 5y history) |
| Professional | 120 | 50,000 | 10 | ≤10 yrs | CSV ≤1M rows | + run:simulate, full history |
| API (metered) | 600 | per-plan | 50 | ≤15 yrs | Parquet/JSON | all read + simulate |
| Enterprise | 2,000 | custom | 200 | ≤25 yrs | full lakehouse export | all + read:assess (own tenant) |
| Government | custom | custom | custom | full | full | all + subnational deep-dives |
| University | 300 | 100,000 | 20 | ≤15 yrs | CSV/Parquet (research) | all read + simulate |
Rate-limit headers on every response:
X-RateLimit-Limit: 120 X-RateLimit-Remaining: 117 X-RateLimit-Reset: 1750700000 Retry-After: 12 # only on 429
Limits enforced at the gateway with a Redis token bucket; metered tiers also emit usage events to billing.
Pagination: ?limit= (default 50, max 1000) + ?cursor=. Filtering: ?year=, ?from=&to=, ?fields=. Standard error envelope:
{ "error": { "code": "tier_forbidden", "message": "Forecast horizon >10y requires Enterprise.", "request_id": "req_01J..." } }
---
Endpoints
Geographies
GET /country list + rank, filter by region/income_group
GET /country/{iso3} profile + headline CC + pillars
GET /country/{iso3}/indicators all indicator_values (year filters)
GET /country/{iso3}/forecast forecast for cc_global (horizon by tier)
GET /state/{iso3166_2} subnational profile (e.g. US-CA)
GET /state?country={iso3} subnationals of a country, ranked
GET /city/{city_id} city profile + rank
GET /city?country=&tech_hub=true city search/leaderboard
Organizations & work
GET /company/{company_id} company profile + ai_adoption_score
GET /company?industry=&country= search / peer leaderboard
GET /industry/{code} industry panel + exposure scores
GET /industry/{code}/occupations occupations weighted into the industry
GET /occupation/{soc} occupation exposure + tasks + wage/employment
GET /occupation?exposure_gt=0.6 screen occupations by exposure
Indices, forecasts, scenarios
GET /index list published indices + methodology_version
GET /index/{index_id} definition, weights, pillars
GET /index/{index_id}/values ranked index_values (entity_level, year)
GET /forecast/{target_id} point + p05..p95 bands, model metadata
POST /scenario create a scenario (assumptions → run)
GET /scenario/{scenario_id} fetch a saved/preset scenario + results
POST /simulate run an ad-hoc scenario (no persistence)
Personal (PII-bearing, scope read:assess)
POST /assess personal CC-exposure assessment
GET /assess/{assessment_id} retrieve (owner only)
GET /career?occupation=®ion= career/reskilling recommendations
Meta
GET /catalog/indicators data dictionary
GET /catalog/sources provenance + freshness
GET /export create async bulk export job
GET /export/{job_id} poll export job → signed URL
GET /healthz /version /openapi.json
---
Example responses
GET /country/USA
{
"country_id": "USA", "name": "United States", "region": "North America",
"income_group": "High", "cc_global": {
"score": 87.4, "rank_global": 1, "percentile": 99.6, "year": 2025,
"methodology_version": "cc_global@2025.2",
"pillars": { "research": 91.2, "talent": 84.0, "infrastructure": 88.7,
"adoption": 85.1, "policy": 79.3 }
},
"trend": { "2021": 79.1, "2022": 81.6, "2023": 83.9, "2024": 85.8, "2025": 87.4 },
"sources": ["oecd_ai","stanford_ai_index","world_bank","wipo","github"]
}
GET /occupation/15-1252.00 (Software Developers)
{
"occupation_id": "15-1252.00", "title": "Software Developers",
"median_wage_usd": 132270, "employment": 1692100, "job_zone": 4,
"cc_exposure": { "automation_exposure": 0.31, "augmentation_exposure": 0.78,
"ai_complementarity": 0.62, "cc_exposure_index": 58.3 },
"interpretation": "High augmentation, low displacement — AI complements core tasks.",
"growth_outlook": "Much faster than average",
"top_exposed_tasks": [
{ "task": "Write/test/debug code", "ai_exposure": 0.84 },
{ "task": "Design software architecture", "ai_exposure": 0.41 }
]
}
POST /simulate
// request
{ "target_id": "cc_industry:5415", "method": "monte_carlo", "n_draws": 5000,
"horizon_years": 8,
"assumptions": { "adoption_curve": "bass", "p": 0.03, "q": 0.45,
"policy_shock": { "year": 2027, "delta": -0.1 } } }
// response
{ "target_id": "cc_industry:5415", "method": "monte_carlo", "n_draws": 5000,
"trajectory": [
{ "year": 2026, "p50": 61.2, "p05": 58.1, "p95": 64.0 },
{ "year": 2030, "p50": 78.9, "p05": 70.4, "p95": 86.1 },
{ "year": 2033, "p50": 88.3, "p05": 79.0, "p95": 94.7 }
],
"convergence": { "rhat_max": 1.01, "ess_min": 1820 },
"model_version": "scenario_engine@1.4" }
POST /assess (PII)
// request
{ "occupation_soc": "43-3031.00", "region": "US-TX",
"skills": ["bookkeeping","quickbooks","reconciliation"], "tenure_years": 8 }
// response
{ "assessment_id": "asm_01J...", "cc_personal_score": 71.5,
"exposure_band": "high-automation",
"drivers": [ {"factor":"routine data entry","weight":0.4},
{"factor":"rules-based reconciliation","weight":0.3} ],
"recommendations": [
{ "path": "FP&A / financial analyst", "skill_gap": ["modeling","sql"],
"transition_difficulty": "moderate", "wage_delta_usd": 24000 }
],
"retention_expires_at": "2026-12-23T00:00:00Z" }
---
GraphQL
A federated GraphQL gateway (https://api.cognitivecoefficient.com/graphql) sits over the same services for clients that need precise field selection and cross-entity joins in one round-trip — e.g. "countries in the OECD with their top-3 most AI-exposed industries and 2030 forecast bands." Same auth/tiers/rate-limits as REST; query cost analysis caps complexity per tier (Free 1k points, Enterprise 50k). Persisted queries required for Free tier.
query { country(iso3:"DEU") {
name ccGlobal { score rank }
industries(first:3, orderBy: EXPOSURE_DESC) {
name automationExposure forecast(year:2030){ p50 p05 p95 } } } }
Bulk export
GET /export creates an async job (large extracts never block a request). Tiers gate format and volume: Free none; Pro CSV ≤1M rows; API/University Parquet+JSON; Enterprise/Government signed access to curated lakehouse partitions (Parquet/Iceberg) with row/column scoping by tenant. Jobs poll to a time-limited signed URL; every export is recorded in audit_log with the requesting key and filter set.
Security & Privacy
CognitiveCoefficient — Security & Privacy Architecture
Threat posture: a multi-tenant data platform serving Free→Government tiers, holding (a) valuable proprietary indices/forecasts, (b) customer billing data, and (c) sensitive personal career-exposure assessments (PII). The crown jewels are tenant isolation and PII protection; the published indices are public-facing but their unpublished forecasts and bulk extracts are gated.
---
1. Authentication
- Users: OIDC/OAuth2 (platform IdP + optional SSO/SAML for Enterprise/Government/University tenants). Passwords (where used) argon2id; MFA (TOTP/WebAuthn) available, mandatory for
admin/ownerand all Government tenants. - Sessions: short-lived JWT access tokens (≤15 min) + rotating refresh tokens in
HttpOnly,Secure,SameSite=Strictcookies. Token revocation list checked at the gateway. - API keys: prefixed (
cc_live_/cc_test_), high-entropy, shown once, stored only as a hash (key_hash). Keys carryscopes,tier,expires_at; rotation and revocation are first-class (revoked_at). Per-keylast_used_atenables stale-key detection.
2. Authorization
- RBAC within a tenant (
owner/admin/member) + scope-based authz on API keys (read:index,run:simulate,read:assess…). - Tier enforcement at the gateway: horizon limits, export formats, endpoint access (e.g. subnational deep-dives gated to Government,
read:assessto the owning tenant only). - Deny-by-default; every privileged action authorized server-side (never trust the client).
3. Tenant isolation
- Shared-schema, tenant-scoped: every OLTP row carries
org_id; Postgres Row-Level Security policies enforceorg_id = current_setting('app.org_id')on all tenant tables — defense-in-depth so an app-layer bug can't cross tenants. Set per-request from the authenticated principal. - Analytical data is public/global (indices/indicators), so the warehouse isn't tenant-partitioned — but tenant-specific artifacts (saved scenarios, assessments, exports) are
org_id-scoped and RLS-protected. - Enterprise/Government can opt into stronger isolation: dedicated schema or dedicated DB, and data-residency pinning (
orgs.data_residency) for region-locked deployments. - Export jobs are row/column-scoped to the requesting tenant's entitlements; signed URLs are short-lived and single-tenant.
4. PII handling (personal assessments)
Career assessments contain self-reported role/skills/region — treated as sensitive PII:
- Data minimization: collect only what the assessment needs; anonymous assessments allowed (no
user_id). - Encryption at rest of
assessments.inputs(column-level / envelope encryption via KMS) — separate from the public analytical store. - Logical separation: PII lives in the OLTP/
assessment-svcboundary, never copied into the analytical warehouse or training data without aggregation/anonymization. - Retention & right-to-erasure:
retention_expires_atTTL auto-purges; user-initiated delete cascades and is logged. GDPR/CCPA DSAR (access/export/delete) supported. - No PII in logs/telemetry: scrubbing middleware; assessment inputs never logged in plaintext. Aggregate/anonymized stats only for analytics.
- Consent: explicit consent gate before an assessment is stored; clear processing notice.
5. Secrets management
- All secrets (DB creds, source API tokens, KMS keys, Stripe keys, signing keys) in a secrets manager (Vault / cloud secrets), injected at runtime — never in images, env files in git, or code. (Homelab tier: secrets in gitignored
secrets/, never committed — consistent with repo policy.) - Short-lived, rotated DB credentials; per-service least-privilege DB roles. Key/secret rotation runbooks.
6. Transport & network security
- TLS 1.2+ everywhere, HSTS, modern cipher suites; TLS terminated at the edge (Cloudflare/Caddy). v1: Caddy on netops01 auto-manages certs for
*.cognitivecoefficient.com. - WAF + rate limiting + bot/DDoS at the edge; API gateway adds per-key/tier quotas (Redis token bucket) — abuse and scraping protection for the valuable data.
- Network segmentation: services in private subnets; only the gateway/edge is public. mTLS within the service mesh (target). DB/warehouse never internet-exposed.
- CORS locked to first-party origins for cookie-auth routes; API-key routes are origin-agnostic but key-bound.
7. Application security
- Input validation via Pydantic v2 on every endpoint; parameterized queries only (no string-built SQL) — SQLi-proof.
- Standard headers: CSP, X-Content-Type-Options, X-Frame-Options/
frame-ancestors, Referrer-Policy. - Dependency scanning (Dependabot/Trivy), image signing (cosign), SAST in CI, periodic pen-tests. Pinned, scanned base images (distroless).
- GraphQL query-cost/depth limits per tier to prevent expensive-query DoS.
8. Audit & compliance
- Append-only
audit_logfor security-relevant events: logins, key create/revoke, exports, assessment access/delete, admin actions, tier changes — with actor,org_id, IP, UA, timestamp. - Centralized, tamper-evident log shipping; retention per compliance needs.
- Compliance targets: SOC 2 Type II (Enterprise/Government readiness), GDPR/CCPA (PII), and government-tenant requirements (e.g. FedRAMP-style controls, data residency) as the customer base warrants.
- Incident-response runbook, breach-notification process, and regular access reviews (stale keys, dormant accounts).
Roadmap
CognitiveCoefficient — Roadmap (v1 → global standard)
Each phase is gated by a concrete, demonstrable milestone. The architecture (consolidated FastAPI → cloud-native mesh) is decomposed in step with these phases, never ahead of need.
---
Phase 0 — v1 Live (shipped) ✅
- Consolidated FastAPI app serving the terminal (server-rendered pages + static charts) and the JSON API at
www.cognitivecoefficient.com(Caddy/TLS on netops01). - DuckDB + curated Parquet/CSV (
app/data/) as the analytical engine; offline collector scripts for the seed sources; precomputed indices + forecast artifacts inapp/ml. - Core endpoints live (
/country /state /city /company /industry /occupation /index /forecast /scenario /assess /career /simulate), API-key auth, Free/Pro tier gating. - Exit: real data for a credible seed set of countries/industries/occupations, all routes 200, public indices browseable.
Phase 1 — Data depth & credibility (0–3 mo)
- Full ETL coverage: all 13 sources ingesting; entity-resolution crosswalks; validation gates (Great Expectations) and anomaly/confidence scoring.
- Methodology v1 published (transparent weights, versioned) — the credibility cornerstone.
- Postgres for OLTP; user accounts, API keys, billing (Stripe) live; Pro tier self-serve.
- Exit: defensible, sourced, reproducible indices across ≥150 countries + subnationals + key industries/occupations; methodology whitepaper out.
Phase 2 — Product depth & the terminal (3–6 mo)
- Next.js frontend replacing server pages: geo explorer (deck.gl maps), forecast fans, scenario sliders, cmd-k terminal, watchlists/alerts.
- Forecasting matured: full model championship (Bass/logistic/Gompertz/Prophet/XGBoost/PyMC), backtest cards, uncertainty bands everywhere.
- Scenario engine GA (parametric + Monte-Carlo); personal assessment + career recommendations polished.
- Exit: a genuinely "Bloomberg-terminal-feeling" product; Pro conversions; first API customers.
Phase 3 — Scale & platform hardening (6–12 mo)
- Migrate analytical serving to ClickHouse; lakehouse on object store (Parquet/Iceberg); Airflow orchestration replaces cron; MLflow model registry.
- Carve out compute/PII-heavy services (forecast-, scenario-, assessment-svc); API gateway + rate-limit/billing externalized; GraphQL gateway.
- K8s + Terraform + ArgoCD; observability (OTel/Prometheus/Grafana/Sentry); SOC 2 Type II readiness; PII controls (encryption, RLS, erasure) audited.
- Exit: sub-100ms reads at scale, multi-tenant isolation proven, first Enterprise contracts and an API business line.
Phase 4 — Authority & the standard (12–24 mo)
- University + Government tiers: citable methodology (DOIs), policy-scenario modeling, subnational/city depth, on-prem/region-locked options.
- Annual flagship "State of the Intelligence Economy" report + live index (the OECD-AI-Index / Stanford-AI-Index analog) — press, citations, authority.
- Bayesian scenario tier GA; cross-source rigor; data-quality SLAs.
- Exit: CC cited by researchers, journalists, and at least one government/IGO using it for policy; recognized methodology.
Phase 5 — Global reference / Bloomberg-of-AI (24 mo+)
- Network-effect flywheel: free public index drives authority → Enterprise/Gov pay to align to "the CC standard"; API embedded across fintech/HR-tech/edtech.
- Real-time/continuous indicators (HF/GitHub/arXiv live feeds), expanded coverage (every country + major subnationals + thousands of companies/occupations), partner data marketplace.
- Multi-region deployment, full compliance (SOC 2, GDPR/CCPA, gov frameworks), white-label/embedded offerings.
- North star: when a strategist, minister, or investor wants the authoritative read on the intelligence economy, CognitiveCoefficient is the default — the Bloomberg Terminal / OECD AI Index / World Bank AI Observatory of machine intelligence, in one platform.