System Architecture

CognitiveCoefficient — System Architecture

> The Bloomberg Terminal / OECD.AI / World Bank Observatory for the intelligence economy. > CognitiveCoefficient (CC) measures, indexes, and forecasts the diffusion of machine intelligence across countries, subnational regions, cities, companies, industries, and occupations — and lets users assess their own exposure and simulate scenarios.

This document describes (A) the full-scale target architecture, (B) the pragmatic v1 consolidated architecture actually shipped, and (C) the migration path between them.

---

A. Target Architecture (full scale, cloud-native)

A.0 Architectural principles

Read-mostly, analytics-heavy. The product is a data observatory; 95%+ of traffic is reads over precomputed indices, time series, and forecasts. Architecture optimizes for cheap fan-out reads and batch recompute, not OLTP write throughput.
Warehouse-of-record, serving-cache-of-convenience. A columnar analytical warehouse is the source of analytical truth; Postgres holds only relational/operational state (users, keys, billing, saved scenarios, metadata catalog).
Reproducible indices. Every published index value is a pure function of (raw indicator values × methodology version × weights). Recompute is deterministic and versioned; a value can always be re-derived and audited.
Methodology as code + as data. Index definitions, weights, normalization rules, and forecast model configs are versioned artifacts, not ad-hoc SQL.
Tiered access from day one. Free/Pro/Enterprise/API/Government/University tiers are enforced at the gateway, not bolted on later.

A.1 Logical layers

                         ┌──────────────────────────────────────────────┐
   Browser / Terminal ──▶│  Edge: CDN + WAF + TLS (Cloudflare/Fastly)    │
                         └───────────────┬──────────────────────────────┘
                                         ▼
            ┌────────────────────────────────────────────────────────┐
 Frontend   │  Next.js / React / TypeScript (App Router, RSC, SSR/ISR)│
 (Vercel or │  - Terminal UI, map/geo explorer, charting (visx/ECharts)│
  K8s pods) │  - Career/assessment wizard, scenario simulator UI       │
            └───────────────┬───────────────────────┬─────────────────┘
                            ▼                        ▼
            ┌───────────────────────────┐  ┌────────────────────────────┐
 API Gateway│ Kong / APISIX / Envoy     │  │ GraphQL Gateway (Apollo/    │
            │ - authN (JWT, API keys)   │  │ Strawberry) federated        │
            │ - rate limit, quotas,     │  └────────────────────────────┘
            │   tier routing, billing   │
            └───────────────┬───────────┘
                            ▼
   ┌────────────────────────────────────────────────────────────────────┐
   │ Service mesh (Python / FastAPI microservices, async, Pydantic v2)  │
   │                                                                     │
   │  geo-svc        company-svc     industry-svc    occupation-svc      │
   │  index-svc      forecast-svc    scenario-svc    assessment-svc      │
   │  career-svc     export-svc      catalog-svc     auth/billing-svc    │
   └───────┬───────────────────┬────────────────────────┬───────────────┘
           ▼                   ▼                        ▼
 ┌──────────────────┐ ┌────────────────────┐ ┌───────────────────────────┐
 │ Postgres (OLTP)  │ │ ClickHouse (OLAP   │ │ DuckDB (embedded analytics│
 │ users, api_keys, │ │ serving): indicator│ │ + per-request slices,     │
 │ billing, scenarios│ │ _values, indices,  │ │ notebook/export jobs,     │
 │ metadata catalog │ │ forecasts, panels  │ │ small-table joins)        │
 └──────────────────┘ └────────────────────┘ └───────────────────────────┘
           ▲                   ▲                        ▲
           │                   │   Lakehouse (Parquet on S3/R2, Iceberg) │
           │                   └─────────────┬──────────┘                │
           │                                 ▼                           │
   ┌────────────────────────────────────────────────────────────────────┐
   │ ETL / orchestration: Apache Airflow (KubernetesExecutor)            │
   │  - 40+ source DAGs (World Bank, IMF, OECD, BLS, OECD.AI, AI Index,  │
   │    HuggingFace, GitHub, arXiv, O*NET, WIPO, Top500…)                │
   │  - dbt (warehouse transforms), Great Expectations (validation)     │
   │  - model training/scoring DAGs (Prophet/XGBoost/PyMC)              │
   └────────────────────────────────────────────────────────────────────┘

A.2 Frontend

Next.js 14+ (App Router) in TypeScript. React Server Components for data-dense pages (country/company/industry profiles render server-side against the API), ISR for index/leaderboard pages (revalidate on each ETL publish), client components for interactive widgets.
Visualization: ECharts + visx for time series/forecast fans; deck.gl/MapLibre for the geo explorer (countries → subnationals → cities choropleths); a "terminal" command palette (cmd-k) for power users à la Bloomberg.
State/data fetching: TanStack Query against the REST API; SWR for the GraphQL gateway. Auth via NextAuth/Auth.js with the platform's own OIDC.
Design system: shared component library, dark "terminal" theme primary.

A.3 Backend services (FastAPI/Python)

Each domain is an independently deployable FastAPI service (Python 3.12, async, Pydantic v2 models, Uvicorn/Gunicorn workers, served behind the gateway):

Service	Responsibility
`geo-svc`	country / state-province / city entities, profiles, rankings
`company-svc`	company profiles, AI-adoption indicators, peer sets
`industry-svc`	industry (NAICS/ISIC) panels, exposure scores
`occupation-svc`	occupation (O*NET/SOC) automation/augmentation exposure
`index-svc`	index definitions, weights, on-the-fly + materialized index values
`forecast-svc`	serves model forecasts + uncertainty bands; triggers retrains
`scenario-svc`	parametric + Monte-Carlo scenario simulation engine
`assessment-svc`	personal CC-exposure assessments (PII-bearing)
`career-svc`	career path / reskilling recommendations
`export-svc`	bulk CSV/Parquet/JSON exports, async export jobs
`catalog-svc`	data dictionary, source provenance, methodology registry
`auth/billing-svc`	tenants, API keys, tiers, quotas, Stripe metering

Cross-cutting: shared cc-core Python package (Pydantic schemas, warehouse client, auth middleware, telemetry).

A.4 Data stores

Postgres 16 (managed, e.g. RDS/Cloud SQL/Neon) — OLTP: users, api_keys, subscriptions, scenarios (saved), assessments metadata, audit_log, and the metadata catalog (indicator definitions, sources, methodology versions). Logical replication to the warehouse for catalog joins.
ClickHouse — primary OLAP serving store: indicator_values, indices, index_values, forecasts, wide country/industry/occupation panels. MergeTree tables partitioned by (geo_level, year); materialized views precompute rankings and YoY deltas. Powers sub-100ms reads for the terminal.
DuckDB — embedded engine inside export-svc and analyst/notebook workloads; reads Parquet directly from the lakehouse for ad-hoc joins and per-request scenario slices without hitting ClickHouse.
Lakehouse — Parquet on S3/R2 with Apache Iceberg table format; raw → staged → curated zones. The immutable raw zone is the audit substrate for reproducibility.
Redis — gateway rate-limit counters, hot-response cache, Celery/RQ broker for async jobs.

A.5 Warehouse abstraction

A single WarehouseClient interface (in cc-core) abstracts the analytical backend so services are storage-agnostic:

class WarehouseClient(Protocol):
    def query(self, sql: str, params: dict) -> pyarrow.Table: ...
    def panel(self, entity, indicators, years) -> pyarrow.Table: ...

Implementations: DuckDBClient (v1, embedded), ClickHouseClient (scale), LakehouseClient (Parquet/Iceberg). The same SQL dialect surface (ANSI subset + standard window funcs) runs across DuckDB and ClickHouse, so the migration is a config flip, not a rewrite.

A.6 ETL / orchestration

Apache Airflow (KubernetesExecutor) runs one DAG per source plus transform/index/forecast DAGs (see etl_architecture_md).
dbt for warehouse transforms (raw → staging → marts), Great Expectations for validation gates, Soda/anomaly jobs for drift.

A.7 Infrastructure & platform

Containers: every service is a Docker image (distroless/slim Python base), built in CI, scanned (Trivy), signed (cosign).
Orchestration: Kubernetes (EKS/GKE or k3s on the homelab tier). HPA on the read services; KEDA scales Airflow workers on queue depth.
IaC: Terraform for cloud (VPC, managed PG, object storage, ClickHouse Cloud/Altinity, DNS, secrets) + Helm charts per service. GitOps via ArgoCD.
CI/CD: GitHub Actions → build/test/scan → push → ArgoCD sync. Blue/green for the gateway, rolling for services.
Observability: OpenTelemetry traces → Tempo/Jaeger; Prometheus + Grafana metrics; Loki logs; Sentry for app errors; index-svc/forecast-svc emit data-freshness and model-drift metrics.
Environments: dev (homelab/k3s), staging, prod.

---

B. v1 Consolidated Architecture (what actually shipped)

The deployed v1 is a single consolidated FastAPI application — deliberately one process, one repo, one deploy. It is the pragmatic compression of the target diagram into a monolith that preserves the seams needed to split later.

cognitivecoefficient/
└── app/
    ├── main.py              # FastAPI app: mounts routers, static, pages
    ├── pages/               # server-rendered Jinja/HTML pages (the "terminal")
    ├── static/
    │   ├── css/  js/  vendor/   # frontend assets, charting vendored client-side
    ├── data/                # DuckDB file(s) + curated Parquet/CSV snapshots
    ├── routers/             # /country /state /city /company /industry
    │                        # /occupation /index /forecast /scenario
    │                        # /assess /career /simulate
    ├── services/            # geo, index, forecast, scenario, assessment logic
    │                        #  = the future microservice boundaries, as modules
    ├── models/              # Pydantic schemas (shared with future cc-core)
    └── ml/                  # fitted forecast models + scenario engine

Key v1 properties:

One FastAPI process serves both the HTML pages (app/pages, app/static) and the JSON API (app/routers/*). No separate Next.js frontend yet — the "terminal" is server-rendered HTML + vendored JS charts. This is the app/{pages,static} layout actually on disk.
DuckDB as the only analytical engine, embedded, reading the curated Parquet/CSV in app/data/. No ClickHouse, no separate warehouse cluster. Indices and forecasts are precomputed offline and shipped as data files; the API serves slices of them.
SQLite/Postgres-lite for operational state (users, api_keys, saved scenarios) — a single relational DB; in the smallest deploy this can be a DuckDB/SQLite file, with a clean swap to Postgres.
ETL is offline batch scripts, not Airflow — Python collectors run on a schedule (cron/ctask) producing the Parquet snapshots committed/synced into app/data/. The DAG logic exists as ordered functions, ready to lift into Airflow tasks.
ML is "train offline, serve artifacts." Prophet/XGBoost/PyMC models are fit in notebooks/batch, serialized, and loaded by app/ml; the /forecast and /simulate endpoints serve precomputed + light on-request computation (Monte-Carlo runs bounded for latency).
Deploy: single container (or static-export + FastAPI) behind Caddy on netops01 at www.cognitivecoefficient.com (apex → www redirect, TLS terminated by Caddy). One image, docker compose up.
Auth/tiers: API-key middleware in-process; tier limits enforced with Redis or an in-memory/SQLite token bucket.

This is enough to be live, correct, and credibly "Bloomberg-for-AI" on day one, at a fraction of the operational cost.

---

C. Migration Path (v1 → target)

The consolidation was designed to decompose along pre-cut seams, in dependency order:

Step	Move	Trigger	How (because seams already exist)
1	Postgres for OLTP	first paying tenants / multi-instance	Swap the relational DB URL; `models/` Pydantic + repo layer already DB-agnostic.
2	Split the frontend to Next.js	need rich interactive terminal/maps	Stand up Next.js consuming the same JSON routers; retire `app/pages`. API contract unchanged.
3	Airflow for ETL	sources/schedules outgrow cron	Wrap existing collector functions as Airflow tasks; output still Parquet → lakehouse.
4	Lakehouse (Parquet/Iceberg on object store)	data volume / reproducibility/audit needs	Point the warehouse client's raw/curated paths at S3/R2 instead of `app/data/`.
5	ClickHouse serving	read latency/concurrency at scale	Implement `ClickHouseClient` behind the existing `WarehouseClient` interface; flip config per env. DuckDB stays for exports.
6	Carve out services (forecast-svc, scenario-svc, assessment-svc first — the compute/PII-heavy ones)	independent scaling / isolation (esp. PII)	Each `services/<x>` module + its router becomes a FastAPI service importing `cc-core`. Gateway routes by path.
7	API Gateway + GraphQL	external API customers, federation	Move key/tier/rate-limit out of the monolith into Kong/APISIX; add Apollo/Strawberry GraphQL over the REST services.
8	K8s + Terraform + ArgoCD	reliability/SLA commitments	Containers already exist; add Helm/Terraform/GitOps.

Invariant across the whole path: the public API contract (/country, /index, /forecast, …), the Pydantic schemas, and the WarehouseClient interface never change. v1 monolith and target mesh are the same API over a swappable substrate — which is what makes "consolidated FastAPI now, cloud-native later" a refactor rather than a rebuild.

Data Model

CognitiveCoefficient — Data Model

The data model has three concerns: (1) the analytical core (geographies, organizations, work, indicators, indices, forecasts, scenarios) — read-mostly, columnar-friendly; (2) operational state (users, api_keys, billing, saved scenarios) — relational/OLTP; (3) the metadata catalog (sources, methodology, provenance) — the glue that makes every value auditable.

Naming: snake_case; surrogate id keys (ULID/UUIDv7) for OLTP, stable natural codes (ISO/SOC/NAICS) for analytical dimensions. All fact tables carry source_id, methodology_version, and ingested_at for provenance.

---

1. Geographic dimension

`countries`

field	type	notes
`country_id`	text PK	ISO 3166-1 alpha-3 (`USA`, `DEU`)
`iso2`	text	ISO 3166-1 alpha-2
`name`	text	canonical English name
`region`	text	UN/World Bank region
`subregion`	text
`income_group`	text	World Bank: Low/Lower-mid/Upper-mid/High
`oecd_member`	bool
`population`	bigint	latest
`gdp_usd`	numeric	latest, nominal
`gdp_per_capita_ppp`	numeric
`centroid_geom`	geometry	for map rendering
`cc_rank`	int	current CognitiveCoefficient global rank

`subnationals` (states/provinces)

field	type	notes
`subnational_id`	text PK	ISO 3166-2 (`US-CA`, `DE-BY`)
`country_id`	text FK → countries
`name`, `type`	text	"state","province","region"
`population`, `gdp_usd`	numeric
`cc_rank_national`	int	rank within country

`cities`

field	type	notes
`city_id`	text PK	GeoNames id or `country-slug`
`subnational_id`	text FK	nullable
`country_id`	text FK
`name`, `metro_name`	text
`population`, `lat`, `lon`	numeric
`is_tech_hub`	bool
`cc_rank_global`	int	city-level CC rank

---

2. Organization & work dimensions

`companies`

field	type	notes
`company_id`	text PK	ULID; `ticker`/`lei` as natural keys
`name`, `legal_name`	text
`ticker`, `lei`, `domain`	text	identifiers
`hq_country_id`, `hq_city_id`	FK
`industry_id`	FK → industries	primary
`employee_count`, `market_cap_usd`	numeric
`founded_year`	int
`is_ai_native`	bool
`ai_adoption_score`	numeric	0–100 composite
`cc_rank_industry`	int

`industries`

field	type	notes
`industry_id`	text PK	NAICS or ISIC code
`code_system`	text	`NAICS` \	`ISIC`
`name`, `sector`	text
`parent_industry_id`	text FK	self-ref hierarchy
`level`	int	2/3/4-digit depth
`automation_exposure`	numeric	0–1
`augmentation_potential`	numeric	0–1
`cc_industry_index`	numeric

`occupations`

field	type	notes
`occupation_id`	text PK	O*NET-SOC code (`15-1252.00`)
`soc_code`	text	6-digit SOC
`title`, `description`	text
`job_zone`	int	O*NET 1–5
`median_wage_usd`	numeric	BLS OEWS
`employment`	bigint	BLS
`automation_exposure`	numeric	0–1 (task-level rollup)
`augmentation_exposure`	numeric	0–1
`ai_complementarity`	numeric	-1..1 (substitute↔complement)
`cc_exposure_index`	numeric	headline per-occupation CC
`growth_outlook`	text	BLS projection band

occupation_tasks (bridge): occupation_id, task_id, task_desc, importance, ai_exposure — the task-level substrate that rolls up into occupation exposure.

---

3. Indicators (the raw measurement layer)

`indicators` (definition / dictionary)

field	type	notes
`indicator_id`	text PK	e.g. `ai_patents_per_capita`
`name`, `description`	text
`unit`	text	`%`, `USD`, `count`, `index`
`entity_level`	text	`country`\	`subnational`\	`city`\	`company`\	`industry`\	`occupation`
`source_id`	FK → sources	primary source
`direction`	text	`higher_better`\	`lower_better`
`frequency`	text	`annual`\	`quarterly`\	`monthly`
`min_year`, `max_year`	int	coverage
`methodology_url`	text

`indicator_values` (the fact table — largest)

field	type	notes
`indicator_id`	FK
`entity_level`	text	partition key
`entity_id`	text	resolves to the right dimension by level
`year`	int	(or `period` for sub-annual)
`value`	numeric	raw measured value
`value_normalized`	numeric	0–100 min-max/z-score (per methodology)
`source_id`	FK
`methodology_version`	text
`confidence`	numeric	0–1 data-quality score
`is_imputed`	bool	flag for filled gaps
`ingested_at`	timestamp

PK: (indicator_id, entity_level, entity_id, year, methodology_version). In ClickHouse: MergeTree PARTITION BY (entity_level, year) ORDER BY (indicator_id, entity_id).

---

4. Indices (the published products)

`indices` (index definitions)

field	type	notes
`index_id`	text PK	`cc_global`, `cc_industry`, `cc_occupation`
`name`, `description`	text
`entity_level`	text	what it ranks
`methodology_version`	text	current published version
`pillars`	jsonb	named sub-pillars
`weights`	jsonb	`{indicator_id: weight}`
`normalization`	text	`minmax`\	`zscore`\	`rank`
`aggregation`	text	`weighted_mean`\	`geometric`
`published_at`	timestamp

`index_values`

field	type	notes
`index_id`	FK
`entity_level`, `entity_id`
`year`	int
`score`	numeric	0–100 headline
`rank_global`, `rank_peer`	int
`percentile`	numeric
`pillar_scores`	jsonb	breakdown
`methodology_version`	text	reproducibility

---

5. Forecasts & scenarios

`forecasts`

field	type	notes
`forecast_id`	text PK
`target_type`	text	`indicator`\	`index`
`target_id`	text	which indicator/index
`entity_level`, `entity_id`
`model`	text	`prophet`\	`xgboost`\	`bass`\	`logistic`\	`gompertz`\	`pymc`
`model_version`	text
`horizon_years`	int
`trained_at`	timestamp
`backtest_mape`	numeric	accuracy metadata

`forecast_values`

field	type	notes
`forecast_id`	FK
`year`	int	future period
`point`	numeric	median/expected
`p05`,`p25`,`p50`,`p75`,`p95`	numeric	uncertainty bands
`is_historical`	bool	fitted vs projected

`scenarios`

field	type	notes
`scenario_id`	text PK	ULID
`owner_user_id`	FK → users	null for system presets
`name`, `description`	text
`base_target_id`	text	indicator/index forecast it perturbs
`assumptions`	jsonb	levers: `adoption_rate`, `policy_shock`, `compute_growth`, `shock_year`…
`method`	text	`parametric`\	`monte_carlo`\	`bayesian`
`n_draws`	int	MC sample size
`is_public`	bool
`created_at`	timestamp

scenario_results (cached runs): scenario_id, year, p05..p95, point, run_id, computed_at.

---

6. Operational state (OLTP / Postgres)

`users`

field	type	notes
`user_id`	text PK	UUIDv7
`email`	citext UNIQUE
`password_hash`	text	argon2id (null if SSO-only)
`display_name`	text
`org_id`	FK → orgs	tenant
`tier`	text	`free`\	`pro`\	`enterprise`\	`gov`\	`university`
`role`	text	`member`\	`admin`\	`owner`
`email_verified_at`, `created_at`, `last_login_at`	timestamp
`mfa_enabled`	bool

`orgs` (tenants)

org_id PK, name, tier, seat_limit, data_residency, stripe_customer_id, created_at.

`api_keys`

field	type	notes
`api_key_id`	text PK
`org_id`, `created_by_user_id`	FK
`key_prefix`	text	shown in UI (`cc_live_ab12…`)
`key_hash`	text	sha256/argon2 of full key — secret never stored
`scopes`	text[]	`read:index`, `read:forecast`, `run:simulate`, `read:assess`…
`tier`	text	for rate-limit class
`rate_limit_rpm`	int	override
`last_used_at`, `expires_at`, `revoked_at`	timestamp

`subscriptions`

subscription_id, org_id, plan, status, stripe_subscription_id, current_period_end, metered_usage jsonb.

`assessments` (personal — PII-bearing, isolated)

field	type	notes
`assessment_id`	text PK
`user_id`	FK (nullable for anon)
`occupation_id`	FK	self-reported role
`inputs`	jsonb (encrypted at rest)	skills, tasks, region — PII
`cc_personal_score`	numeric	their exposure result
`recommendations`	jsonb	career-svc output
`created_at`	timestamp
`retention_expires_at`	timestamp	TTL for auto-purge

`audit_log`

audit_id, actor_user_id, org_id, action, target, ip, user_agent, at, metadata jsonb — append-only.

---

7. Metadata catalog (provenance)

`sources`

field	type	notes
`source_id`	text PK	`world_bank`, `oecd_ai`, `bls`, `onet`, `wipo`, `top500`, `hf`, `github`, `arxiv`
`name`, `org`, `url`	text
`license`	text	redistribution terms
`update_frequency`	text
`last_ingested_at`	timestamp	freshness
`reliability_tier`	int	1 (official stats) → 3 (scraped)

`methodology_versions`

methodology_version PK, index_id, changelog, effective_date, weights snapshot, author — every published number references the exact methodology that produced it, enabling deterministic re-derivation and audit.

---

Entity-relationship summary

countries 1─* subnationals 1─* cities
countries 1─* companies   *─1 industries (self-hierarchy)
occupations *─* occupation_tasks
indicators 1─* indicator_values ──┐
indices    1─* index_values       ├─ all FK source_id → sources
forecasts  1─* forecast_values    │              + methodology_version
scenarios  1─* scenario_results ──┘
orgs 1─* users 1─* api_keys ; orgs 1─* subscriptions
users 1─* scenarios ; users 1─* assessments(PII)
everything analytical → sources, methodology_versions  (provenance)

Data-Collection (ETL)

CognitiveCoefficient — ETL / Data-Collection Engine

The ETL engine is the moat: a continuously-updated, validated, normalized, fully-provenanced panel of the intelligence economy assembled from ~13 authoritative sources. The mantra: never trust a number without a source, a method, and a freshness stamp.

---

1. Sources

Source	What it provides	Entity level	Cadence	Access
World Bank (WDI/Indicators API)	GDP, population, R&D %GDP, education, internet/infra	country/subnational	annual	REST/JSON
IMF (SDMX / IFS, WEO)	macro, productivity, GDP forecasts	country	quarterly/annual	SDMX
OECD (SDMX)	productivity, ICT, skills, R&D, business demography	country	annual	SDMX
OECD.AI Policy Observatory	AI policies, AI investment, AI jobs, compute	country	irregular	API/CSV
Stanford AI Index	model counts, training compute, papers, $ investment, benchmarks	country/global	annual	CSV/report
BLS (OEWS, Employment Projections)	wages, employment, occupation projections	occupation	annual	API/flat files
US Census / ACS	industry employment, business formation, demographics	subnational/city/industry	annual	API
*ONET**	task/skill/ability content per occupation	occupation/task	semi-annual	DB download
WIPO (PATENTSTAT / IP stats)	AI patent filings/grants	country/company	annual	bulk/API
HuggingFace Hub	model/dataset counts, downloads, org activity	company/country (by org)	continuous	API
GitHub	AI-repo activity, stars, contributors, language trends	company/country	continuous	GraphQL API
arXiv	AI/ML paper volume, authorship, affiliations	country/company	continuous	OAI-PMH/API
Top500 / Green500	supercomputer/compute capacity by country/org	country/company	semi-annual	scrape/CSV

(Extensible: Crunchbase/PitchBook for funding, LinkedIn-style talent feeds, Epoch AI compute, etc. as licensing allows.)

---

2. Orchestration — Airflow (target)

One source DAG per provider, plus transform/index/forecast DAGs, on a global schedule:

world_bank_ingest        @yearly + monthly freshness check
imf_ingest               @quarterly
oecd_sdmx_ingest         @yearly
oecd_ai_ingest           @weekly
stanford_ai_index_ingest @yearly (+ manual on release)
bls_ingest               @yearly
census_acs_ingest        @yearly
onet_ingest              @monthly
wipo_ingest              @yearly
huggingface_ingest       @daily
github_ingest            @daily
arxiv_ingest             @daily
top500_ingest            @monthly
   │
   ▼ (all land raw → lakehouse raw zone)
transform_normalize_dag  @daily   (dbt: raw→staging→marts, normalization)
build_indices_dag        triggered on new marts (index-svc methodology)
train_forecasts_dag      triggered on new indices (ML DAGs)
publish_dag              loads ClickHouse serving tables + invalidates ISR cache
data_quality_dag         Great Expectations gates + anomaly scan (every run)

Each source DAG: extract → land_raw → validate → normalize → upsert_staging. KubernetesExecutor scales workers; KEDA on queue depth. Backfills are date-partitioned and idempotent. Secrets (API tokens) from the secrets manager, never in code.

v1 shipped: these are ordered Python collector scripts run by cron/ctask, each emitting curated Parquet/CSV into app/data/; the DAG topology above is the lift-and-shift target. Same extract/validate/normalize functions, wrapped as Airflow tasks later.

---

3. Pipeline stages

3.1 Extract & land (raw zone)

Pull via each source's native protocol (REST/SDMX/OAI-PMH/GraphQL/bulk). Land raw, immutable, partitioned by source/ingest_date in object storage (Parquet) — this raw zone is the audit substrate and enables full re-derivation. Capture HTTP metadata, response hash, and row counts.

3.2 Validate (gate)

Great Expectations suites per source: schema/type conformance, value ranges (e.g. shares ∈ [0,1], wages > 0), referential integrity (entity codes resolve to dimensions), expected row-count/coverage. A failed suite quarantines the batch (does not publish) and alerts — bad data never reaches the warehouse.

3.3 Normalize & conform

Entity resolution: map every source's geo/company/industry/occupation identifiers to canonical keys (ISO-3166, NAICS/ISIC, ONET-SOC, internal company_id). Crosswalk tables maintained in the catalog (e.g. SOC↔ONET, NAICS↔ISIC, GitHub-org↔company).
Unit & currency harmonization: convert to common units, deflate/PPP-adjust where needed.
Indicator normalization: min-max → 0–100 and/or z-score per methodology, with direction (higher/lower-better) applied. Output indicator_values.value + value_normalized.
Temporal alignment: annualize/interpolate sub-annual series to the common panel grid; flag interpolated points.
Imputation: model-based gap-fill (XGBoost) for sparse-but-needed indicators, always written with is_imputed = true and a confidence score.

3.4 Load & publish

dbt builds staging→marts (the conformed panel); build_indices_dag computes index_values from current methodology_version; serving tables loaded to ClickHouse (v1: DuckDB files in app/data/); CDN/ISR cache invalidated so the terminal shows fresh numbers.

---

4. Data quality & anomaly detection

Validation gates (above) block structurally-bad batches.
Statistical anomaly detection on every refresh: year-over-year jumps beyond entity-specific thresholds, outliers vs. peer distribution (robust z / IQR), break-point detection on long series, and cross-source disagreement checks (e.g. World Bank vs. IMF GDP must agree within tolerance). Anomalies are flagged, not silently dropped — surfaced to a review queue and lower the confidence score on affected values.
Confidence scoring: every indicator_value carries confidence (0–1) derived from source reliability_tier, recency, imputation flag, and anomaly status — this propagates into index uncertainty.
Freshness SLAs: sources.last_ingested_at monitored; stale sources alert and the catalog/UI badge the data age. Lineage (raw → staging → mart → index → forecast) is tracked end-to-end so any published number traces to its raw rows, source, and methodology version.
Observability: each DAG emits row counts, null rates, validation pass/fail, and runtime to Prometheus/Grafana; failures page via the alerting channel.

Forecasting & ML

CognitiveCoefficient — Forecasting & ML Architecture

The ML layer answers two questions the indices alone cannot: "where is each entity headed?" (forecasting) and "what if?" (scenario simulation). Both are first-class, uncertainty-quantified products, not decorations on a chart.

Design stance: technology diffusion is the right prior. AI adoption across countries/industries/occupations follows S-curves, so the model library is anchored on diffusion theory (Bass, logistic, Gompertz) and augmented with statistical/ML models (Prophet, XGBoost) and a Bayesian core (PyMC) for principled uncertainty.

---

1. Model library & roles

1.1 Diffusion / adoption curves (structural backbone)

Bass diffusion — the workhorse for adoption of a technology in a population (share of firms/workers using AI, share of tasks automated). Parameters: p (innovation/external influence), q (imitation/word-of-mouth), m (market potential / ceiling). Fit by NLS or Bayesian; gives interpretable, bounded, theory-grounded trajectories.
Logistic (Verhulst) — symmetric S-curve for saturating indicators (compute density, model-capability proxies) where inflection is near the midpoint.
Gompertz — asymmetric S-curve (slow start, fast middle, long tail) — better for capability/penetration metrics that decelerate late. We fit all three and select by backtest + information criteria per series.
Role: primary for bounded, saturating indicators and indices (adoption %, exposure share). Their parameters are also the levers the scenario engine perturbs.

1.2 Prophet

Role: fast, robust forecasting for indicators with trend + seasonality + holidays/regime shifts and irregular history (many official-stat series are annual/quarterly with gaps). Used as a strong baseline and for series without a clear saturation ceiling. Native uncertainty intervals; decomposable (trend/seasonal components shown in the terminal).

1.3 XGBoost (gradient-boosted trees)

Role: cross-sectional + panel forecasting where an entity's trajectory depends on covariates — e.g. predict a country's next-year CC pillar from its GDP, R&D spend, patent flow (WIPO), GitHub/HF activity, talent metrics. Captures nonlinear interactions across the rich indicator feature space. Trained on the panel (entity × year × indicators); quantile objective (reg:quantileerror) yields p05/p50/p95 directly. Also powers gap-filling/imputation of sparse indicators (with is_imputed flagged).

1.4 PyMC (Bayesian core)

Role: the principled-uncertainty engine. Three uses:

Bayesian Bass/logistic — priors on p,q,m (informed by peer entities / hierarchical pooling), full posterior over curve parameters → posterior predictive trajectories with honest bands, even with short history.
Hierarchical models — partial pooling across entities (countries within a region, occupations within a family) so data-poor entities borrow strength from data-rich peers.
Scenario priors — analyst assumptions enter as priors/interventions, propagated to the posterior.

MCMC (NUTS) for inference; convergence reported (r_hat, ESS) and surfaced in /simulate responses.

1.5 Model selection

Per (target, entity) we maintain a small ensemble/championship: fit Bass/logistic/Gompertz/Prophet/XGBoost, backtest (rolling-origin), and pick by out-of-sample error (MAPE/MASE) with a structural-model tie-break (prefer diffusion models when fit is comparable, for interpretability and lever-ability). The chosen model + model_version is stored on every forecast row.

---

2. Scenario simulation

Two modes, both exposed via /scenario and /simulate:

2.1 Parametric / deterministic

Analyst sets levers (adoption rate p,q, ceiling m, a policy_shock at year t, a compute_growth multiplier). Engine perturbs the fitted diffusion parameters and re-projects a single trajectory + analytically-propagated band. Fast, used for the interactive sliders in the terminal.

2.2 Monte-Carlo

Sample parameters/shocks from their distributions (n_draws 1k–50k by tier), propagate each through the curve, and aggregate to a fan (p05…p95). Captures interaction of multiple uncertain levers (e.g. uncertain q × stochastic policy shock). Vectorized (NumPy) for latency; large runs go async via the job queue and cache to scenario_results.

2.3 Bayesian scenarios

The PyMC posterior is the scenario distribution: interventions are applied with pm.do/pm.Intervention-style do-operations, and the posterior predictive gives the counterfactual fan with full parameter uncertainty. This is the rigorous tier (Enterprise/Government) for policy analysis.

---

3. Uncertainty quantification (non-negotiable)

Every forecast/scenario ships bands, not just points:

Bayesian models → posterior predictive quantiles.
XGBoost → quantile regression (p05/p25/p50/p75/p95).
Prophet → built-in intervals (with mcmc_samples for full Bayesian intervals on key series).
Diffusion NLS → bootstrap/delta-method bands.

Bands are stored (forecast_values.p05..p95), rendered as fans in the UI, and returned in the API. We also publish backtest accuracy (backtest_mape) alongside each forecast so users can weight it — honesty about uncertainty is a product differentiator vs. point-estimate competitors.

---

4. Training, serving, and the v1 path

Target: training runs as Airflow DAGs (one per model-family/target) on a schedule after ETL publishes new data; artifacts (fitted params, serialized models, posterior summaries) versioned in a model registry (MLflow). forecast-svc loads artifacts and serves; scenario-svc runs Monte-Carlo/Bayesian on demand. Heavy training on GPU/CPU workers (the homelab 6×3090 tier is ideal for PyMC/XGBoost sweeps).
v1 shipped: train offline, serve artifacts. Models are fit in batch/notebooks, serialized into app/ml/, and the consolidated FastAPI app loads them at startup. /forecast serves precomputed forecast_values; /simulate runs bounded Monte-Carlo in-process (draw caps + async fallback) so latency stays sane without a separate service. Same model code, same artifacts — only the orchestration and serving topology change on the way to target.

---

5. Model monitoring & governance

Drift detection: PSI/KS tests on input indicator distributions per ETL run; when inputs drift beyond threshold, flag for retrain.
Forecast accuracy tracking: as actuals arrive, compute realized error vs. prior forecasts → rolling MAPE/MASE dashboards per model/entity; auto-demote a champion if a challenger beats it.
Anomaly guardrails: reject/quarantine forecasts that violate structural constraints (e.g. adoption > ceiling, negative counts).
Reproducibility: model_version + input methodology_version + raw-data snapshot pin every forecast; any published number can be re-derived.
Reporting: convergence diagnostics (r_hat, ESS) surfaced to users for Bayesian runs; backtest cards published in the catalog. Metrics flow to Prometheus/Grafana; retraining and demotions are logged to audit_log.

API Specification

Cognitive Capital API — Specification

Base URL: https://api.cognitivecoefficient.com/v1 Content type: application/json. All timestamps ISO-8601 UTC. All money in USD unless noted. The same API backs the web terminal (v1: served by the consolidated FastAPI app) and external customers.

---

Authentication

Two mechanisms:

API key (machine): header Authorization: Bearer cc_live_<key> or X-API-Key: cc_live_<key>. Keys are tier-scoped and scope-limited (read:index, read:forecast, run:simulate, read:assess…).
OIDC/JWT (browser/session): short-lived access token from the platform's auth service; refresh via cookie. Used by the web terminal.

Keys are shown once on creation; only a hash is stored. Test keys use the cc_test_ prefix and read frozen fixture data.

401 unauthenticated · 403 scope/tier denied · 429 rate-limited

---

Tiers & rate limits

Tier	Requests/min	Daily cap	Concurrency	Forecast horizon	Bulk export	Scopes
Free	30	1,000	2	≤3 yrs	—	read:* (current year + 5y history)
Professional	120	50,000	10	≤10 yrs	CSV ≤1M rows	+ run:simulate, full history
API (metered)	600	per-plan	50	≤15 yrs	Parquet/JSON	all read + simulate
Enterprise	2,000	custom	200	≤25 yrs	full lakehouse export	all + read:assess (own tenant)
Government	custom	custom	custom	full	full	all + subnational deep-dives
University	300	100,000	20	≤15 yrs	CSV/Parquet (research)	all read + simulate

Rate-limit headers on every response:

X-RateLimit-Limit: 120
X-RateLimit-Remaining: 117
X-RateLimit-Reset: 1750700000
Retry-After: 12        # only on 429

Limits enforced at the gateway with a Redis token bucket; metered tiers also emit usage events to billing.

Pagination: ?limit= (default 50, max 1000) + ?cursor=. Filtering: ?year=, ?from=&to=, ?fields=. Standard error envelope:

{ "error": { "code": "tier_forbidden", "message": "Forecast horizon >10y requires Enterprise.", "request_id": "req_01J..." } }

---

Endpoints

Geographies

GET /country                       list + rank, filter by region/income_group
GET /country/{iso3}                profile + headline CC + pillars
GET /country/{iso3}/indicators     all indicator_values (year filters)
GET /country/{iso3}/forecast       forecast for cc_global (horizon by tier)
GET /state/{iso3166_2}             subnational profile (e.g. US-CA)
GET /state?country={iso3}          subnationals of a country, ranked
GET /city/{city_id}                city profile + rank
GET /city?country=&tech_hub=true   city search/leaderboard

Organizations & work

GET /company/{company_id}          company profile + ai_adoption_score
GET /company?industry=&country=    search / peer leaderboard
GET /industry/{code}               industry panel + exposure scores
GET /industry/{code}/occupations   occupations weighted into the industry
GET /occupation/{soc}              occupation exposure + tasks + wage/employment
GET /occupation?exposure_gt=0.6    screen occupations by exposure

Indices, forecasts, scenarios

GET /index                         list published indices + methodology_version
GET /index/{index_id}              definition, weights, pillars
GET /index/{index_id}/values       ranked index_values (entity_level, year)
GET /forecast/{target_id}          point + p05..p95 bands, model metadata
POST /scenario                     create a scenario (assumptions → run)
GET  /scenario/{scenario_id}       fetch a saved/preset scenario + results
POST /simulate                     run an ad-hoc scenario (no persistence)

Personal (PII-bearing, scope `read:assess`)

POST /assess                       personal CC-exposure assessment
GET  /assess/{assessment_id}       retrieve (owner only)
GET  /career?occupation=&region=   career/reskilling recommendations

Example responses

GET /country/USA

{
  "country_id": "USA", "name": "United States", "region": "North America",
  "income_group": "High", "cc_global": {
    "score": 87.4, "rank_global": 1, "percentile": 99.6, "year": 2025,
    "methodology_version": "cc_global@2025.2",
    "pillars": { "research": 91.2, "talent": 84.0, "infrastructure": 88.7,
                 "adoption": 85.1, "policy": 79.3 }
  },
  "trend": { "2021": 79.1, "2022": 81.6, "2023": 83.9, "2024": 85.8, "2025": 87.4 },
  "sources": ["oecd_ai","stanford_ai_index","world_bank","wipo","github"]
}

GET /occupation/15-1252.00 (Software Developers)

{
  "occupation_id": "15-1252.00", "title": "Software Developers",
  "median_wage_usd": 132270, "employment": 1692100, "job_zone": 4,
  "cc_exposure": { "automation_exposure": 0.31, "augmentation_exposure": 0.78,
                   "ai_complementarity": 0.62, "cc_exposure_index": 58.3 },
  "interpretation": "High augmentation, low displacement — AI complements core tasks.",
  "growth_outlook": "Much faster than average",
  "top_exposed_tasks": [
    { "task": "Write/test/debug code", "ai_exposure": 0.84 },
    { "task": "Design software architecture", "ai_exposure": 0.41 }
  ]
}

POST /simulate

// request
{ "target_id": "cc_industry:5415", "method": "monte_carlo", "n_draws": 5000,
  "horizon_years": 8,
  "assumptions": { "adoption_curve": "bass", "p": 0.03, "q": 0.45,
                   "policy_shock": { "year": 2027, "delta": -0.1 } } }
// response
{ "target_id": "cc_industry:5415", "method": "monte_carlo", "n_draws": 5000,
  "trajectory": [
    { "year": 2026, "p50": 61.2, "p05": 58.1, "p95": 64.0 },
    { "year": 2030, "p50": 78.9, "p05": 70.4, "p95": 86.1 },
    { "year": 2033, "p50": 88.3, "p05": 79.0, "p95": 94.7 }
  ],
  "convergence": { "rhat_max": 1.01, "ess_min": 1820 },
  "model_version": "scenario_engine@1.4" }

POST /assess (PII)

// request
{ "occupation_soc": "43-3031.00", "region": "US-TX",
  "skills": ["bookkeeping","quickbooks","reconciliation"], "tenure_years": 8 }
// response
{ "assessment_id": "asm_01J...", "cc_personal_score": 71.5,
  "exposure_band": "high-automation",
  "drivers": [ {"factor":"routine data entry","weight":0.4},
               {"factor":"rules-based reconciliation","weight":0.3} ],
  "recommendations": [
    { "path": "FP&A / financial analyst", "skill_gap": ["modeling","sql"],
      "transition_difficulty": "moderate", "wage_delta_usd": 24000 }
  ],
  "retention_expires_at": "2026-12-23T00:00:00Z" }

---

GraphQL

A federated GraphQL gateway (https://api.cognitivecoefficient.com/graphql) sits over the same services for clients that need precise field selection and cross-entity joins in one round-trip — e.g. "countries in the OECD with their top-3 most AI-exposed industries and 2030 forecast bands." Same auth/tiers/rate-limits as REST; query cost analysis caps complexity per tier (Free 1k points, Enterprise 50k). Persisted queries required for Free tier.

query { country(iso3:"DEU") {
  name ccGlobal { score rank }
  industries(first:3, orderBy: EXPOSURE_DESC) {
    name automationExposure forecast(year:2030){ p50 p05 p95 } } } }

Bulk export

GET /export creates an async job (large extracts never block a request). Tiers gate format and volume: Free none; Pro CSV ≤1M rows; API/University Parquet+JSON; Enterprise/Government signed access to curated lakehouse partitions (Parquet/Iceberg) with row/column scoping by tenant. Jobs poll to a time-limited signed URL; every export is recorded in audit_log with the requesting key and filter set.

Security & Privacy

CognitiveCoefficient — Security & Privacy Architecture

Threat posture: a multi-tenant data platform serving Free→Government tiers, holding (a) valuable proprietary indices/forecasts, (b) customer billing data, and (c) sensitive personal career-exposure assessments (PII). The crown jewels are tenant isolation and PII protection; the published indices are public-facing but their unpublished forecasts and bulk extracts are gated.

---

1. Authentication

Users: OIDC/OAuth2 (platform IdP + optional SSO/SAML for Enterprise/Government/University tenants). Passwords (where used) argon2id; MFA (TOTP/WebAuthn) available, mandatory for admin/owner and all Government tenants.
Sessions: short-lived JWT access tokens (≤15 min) + rotating refresh tokens in HttpOnly, Secure, SameSite=Strict cookies. Token revocation list checked at the gateway.
API keys: prefixed (cc_live_/cc_test_), high-entropy, shown once, stored only as a hash (key_hash). Keys carry scopes, tier, expires_at; rotation and revocation are first-class (revoked_at). Per-key last_used_at enables stale-key detection.

2. Authorization

RBAC within a tenant (owner/admin/member) + scope-based authz on API keys (read:index, run:simulate, read:assess…).
Tier enforcement at the gateway: horizon limits, export formats, endpoint access (e.g. subnational deep-dives gated to Government, read:assess to the owning tenant only).
Deny-by-default; every privileged action authorized server-side (never trust the client).

3. Tenant isolation

Shared-schema, tenant-scoped: every OLTP row carries org_id; Postgres Row-Level Security policies enforce org_id = current_setting('app.org_id') on all tenant tables — defense-in-depth so an app-layer bug can't cross tenants. Set per-request from the authenticated principal.
Analytical data is public/global (indices/indicators), so the warehouse isn't tenant-partitioned — but tenant-specific artifacts (saved scenarios, assessments, exports) are org_id-scoped and RLS-protected.
Enterprise/Government can opt into stronger isolation: dedicated schema or dedicated DB, and data-residency pinning (orgs.data_residency) for region-locked deployments.
Export jobs are row/column-scoped to the requesting tenant's entitlements; signed URLs are short-lived and single-tenant.

4. PII handling (personal assessments)

Career assessments contain self-reported role/skills/region — treated as sensitive PII:

Data minimization: collect only what the assessment needs; anonymous assessments allowed (no user_id).
Encryption at rest of assessments.inputs (column-level / envelope encryption via KMS) — separate from the public analytical store.
Logical separation: PII lives in the OLTP/assessment-svc boundary, never copied into the analytical warehouse or training data without aggregation/anonymization.
Retention & right-to-erasure: retention_expires_at TTL auto-purges; user-initiated delete cascades and is logged. GDPR/CCPA DSAR (access/export/delete) supported.
No PII in logs/telemetry: scrubbing middleware; assessment inputs never logged in plaintext. Aggregate/anonymized stats only for analytics.
Consent: explicit consent gate before an assessment is stored; clear processing notice.

5. Secrets management

All secrets (DB creds, source API tokens, KMS keys, Stripe keys, signing keys) in a secrets manager (Vault / cloud secrets), injected at runtime — never in images, env files in git, or code. (Homelab tier: secrets in gitignored secrets/, never committed — consistent with repo policy.)
Short-lived, rotated DB credentials; per-service least-privilege DB roles. Key/secret rotation runbooks.

6. Transport & network security

TLS 1.2+ everywhere, HSTS, modern cipher suites; TLS terminated at the edge (Cloudflare/Caddy). v1: Caddy on netops01 auto-manages certs for *.cognitivecoefficient.com.
WAF + rate limiting + bot/DDoS at the edge; API gateway adds per-key/tier quotas (Redis token bucket) — abuse and scraping protection for the valuable data.
Network segmentation: services in private subnets; only the gateway/edge is public. mTLS within the service mesh (target). DB/warehouse never internet-exposed.
CORS locked to first-party origins for cookie-auth routes; API-key routes are origin-agnostic but key-bound.

7. Application security

Input validation via Pydantic v2 on every endpoint; parameterized queries only (no string-built SQL) — SQLi-proof.
Standard headers: CSP, X-Content-Type-Options, X-Frame-Options/frame-ancestors, Referrer-Policy.
Dependency scanning (Dependabot/Trivy), image signing (cosign), SAST in CI, periodic pen-tests. Pinned, scanned base images (distroless).
GraphQL query-cost/depth limits per tier to prevent expensive-query DoS.

8. Audit & compliance

Append-only audit_log for security-relevant events: logins, key create/revoke, exports, assessment access/delete, admin actions, tier changes — with actor, org_id, IP, UA, timestamp.
Centralized, tamper-evident log shipping; retention per compliance needs.
Compliance targets: SOC 2 Type II (Enterprise/Government readiness), GDPR/CCPA (PII), and government-tenant requirements (e.g. FedRAMP-style controls, data residency) as the customer base warrants.
Incident-response runbook, breach-notification process, and regular access reviews (stale keys, dormant accounts).

Roadmap

CognitiveCoefficient — Roadmap (v1 → global standard)

Each phase is gated by a concrete, demonstrable milestone. The architecture (consolidated FastAPI → cloud-native mesh) is decomposed in step with these phases, never ahead of need.

---

Phase 0 — v1 Live (shipped) ✅

Consolidated FastAPI app serving the terminal (server-rendered pages + static charts) and the JSON API at www.cognitivecoefficient.com (Caddy/TLS on netops01).
DuckDB + curated Parquet/CSV (app/data/) as the analytical engine; offline collector scripts for the seed sources; precomputed indices + forecast artifacts in app/ml.
Core endpoints live (/country /state /city /company /industry /occupation /index /forecast /scenario /assess /career /simulate), API-key auth, Free/Pro tier gating.
Exit: real data for a credible seed set of countries/industries/occupations, all routes 200, public indices browseable.

Phase 1 — Data depth & credibility (0–3 mo)

Full ETL coverage: all 13 sources ingesting; entity-resolution crosswalks; validation gates (Great Expectations) and anomaly/confidence scoring.
Methodology v1 published (transparent weights, versioned) — the credibility cornerstone.
Postgres for OLTP; user accounts, API keys, billing (Stripe) live; Pro tier self-serve.
Exit: defensible, sourced, reproducible indices across ≥150 countries + subnationals + key industries/occupations; methodology whitepaper out.

Phase 2 — Product depth & the terminal (3–6 mo)

Next.js frontend replacing server pages: geo explorer (deck.gl maps), forecast fans, scenario sliders, cmd-k terminal, watchlists/alerts.
Forecasting matured: full model championship (Bass/logistic/Gompertz/Prophet/XGBoost/PyMC), backtest cards, uncertainty bands everywhere.
Scenario engine GA (parametric + Monte-Carlo); personal assessment + career recommendations polished.
Exit: a genuinely "Bloomberg-terminal-feeling" product; Pro conversions; first API customers.

Phase 3 — Scale & platform hardening (6–12 mo)

Migrate analytical serving to ClickHouse; lakehouse on object store (Parquet/Iceberg); Airflow orchestration replaces cron; MLflow model registry.
Carve out compute/PII-heavy services (forecast-, scenario-, assessment-svc); API gateway + rate-limit/billing externalized; GraphQL gateway.
K8s + Terraform + ArgoCD; observability (OTel/Prometheus/Grafana/Sentry); SOC 2 Type II readiness; PII controls (encryption, RLS, erasure) audited.
Exit: sub-100ms reads at scale, multi-tenant isolation proven, first Enterprise contracts and an API business line.

Phase 4 — Authority & the standard (12–24 mo)

University + Government tiers: citable methodology (DOIs), policy-scenario modeling, subnational/city depth, on-prem/region-locked options.
Annual flagship "State of the Intelligence Economy" report + live index (the OECD-AI-Index / Stanford-AI-Index analog) — press, citations, authority.
Bayesian scenario tier GA; cross-source rigor; data-quality SLAs.
Exit: CC cited by researchers, journalists, and at least one government/IGO using it for policy; recognized methodology.

Phase 5 — Global reference / Bloomberg-of-AI (24 mo+)

Network-effect flywheel: free public index drives authority → Enterprise/Gov pay to align to "the CC standard"; API embedded across fintech/HR-tech/edtech.
Real-time/continuous indicators (HF/GitHub/arXiv live feeds), expanded coverage (every country + major subnationals + thousands of companies/occupations), partner data marketplace.
Multi-region deployment, full compliance (SOC 2, GDPR/CCPA, gov frameworks), white-label/embedded offerings.
North star: when a strategist, minister, or investor wants the authoritative read on the intelligence economy, CognitiveCoefficient is the default — the Bloomberg Terminal / OECD AI Index / World Bank AI Observatory of machine intelligence, in one platform.