CognitiveCoefficient
Overview / Architecture & Deliverables

Architecture & Engineering Deliverables

The target cloud-native platform and the consolidated v1 actually shipped, with the migration path — plus the data model, ETL, ML/forecasting, API, security, and roadmap.

v1 ships as a single, self-contained FastAPI service computing every index and forecast in transparent Python. The distributed stack below (Airflow, ClickHouse, K8s, Bayesian ML) is the documented scaling path.

System Architecture

CognitiveCoefficient — System Architecture

> The Bloomberg Terminal / OECD.AI / World Bank Observatory for the intelligence economy. > CognitiveCoefficient (CC) measures, indexes, and forecasts the diffusion of machine intelligence across countries, subnational regions, cities, companies, industries, and occupations — and lets users assess their own exposure and simulate scenarios.

This document describes (A) the full-scale target architecture, (B) the pragmatic v1 consolidated architecture actually shipped, and (C) the migration path between them.

---

A. Target Architecture (full scale, cloud-native)

A.0 Architectural principles

  1. Read-mostly, analytics-heavy. The product is a data observatory; 95%+ of traffic is reads over precomputed indices, time series, and forecasts. Architecture optimizes for cheap fan-out reads and batch recompute, not OLTP write throughput.
  2. Warehouse-of-record, serving-cache-of-convenience. A columnar analytical warehouse is the source of analytical truth; Postgres holds only relational/operational state (users, keys, billing, saved scenarios, metadata catalog).
  3. Reproducible indices. Every published index value is a pure function of (raw indicator values × methodology version × weights). Recompute is deterministic and versioned; a value can always be re-derived and audited.
  4. Methodology as code + as data. Index definitions, weights, normalization rules, and forecast model configs are versioned artifacts, not ad-hoc SQL.
  5. Tiered access from day one. Free/Pro/Enterprise/API/Government/University tiers are enforced at the gateway, not bolted on later.

A.1 Logical layers

                         ┌──────────────────────────────────────────────┐
   Browser / Terminal ──▶│  Edge: CDN + WAF + TLS (Cloudflare/Fastly)    │
                         └───────────────┬──────────────────────────────┘
                                         ▼
            ┌────────────────────────────────────────────────────────┐
 Frontend   │  Next.js / React / TypeScript (App Router, RSC, SSR/ISR)│
 (Vercel or │  - Terminal UI, map/geo explorer, charting (visx/ECharts)│
  K8s pods) │  - Career/assessment wizard, scenario simulator UI       │
            └───────────────┬───────────────────────┬─────────────────┘
                            ▼                        ▼
            ┌───────────────────────────┐  ┌────────────────────────────┐
 API Gateway│ Kong / APISIX / Envoy     │  │ GraphQL Gateway (Apollo/    │
            │ - authN (JWT, API keys)   │  │ Strawberry) federated        │
            │ - rate limit, quotas,     │  └────────────────────────────┘
            │   tier routing, billing   │
            └───────────────┬───────────┘
                            ▼
   ┌────────────────────────────────────────────────────────────────────┐
   │ Service mesh (Python / FastAPI microservices, async, Pydantic v2)  │
   │                                                                     │
   │  geo-svc        company-svc     industry-svc    occupation-svc      │
   │  index-svc      forecast-svc    scenario-svc    assessment-svc      │
   │  career-svc     export-svc      catalog-svc     auth/billing-svc    │
   └───────┬───────────────────┬────────────────────────┬───────────────┘
           ▼                   ▼                        ▼
 ┌──────────────────┐ ┌────────────────────┐ ┌───────────────────────────┐
 │ Postgres (OLTP)  │ │ ClickHouse (OLAP   │ │ DuckDB (embedded analytics│
 │ users, api_keys, │ │ serving): indicator│ │ + per-request slices,     │
 │ billing, scenarios│ │ _values, indices,  │ │ notebook/export jobs,     │
 │ metadata catalog │ │ forecasts, panels  │ │ small-table joins)        │
 └──────────────────┘ └────────────────────┘ └───────────────────────────┘
           ▲                   ▲                        ▲
           │                   │   Lakehouse (Parquet on S3/R2, Iceberg) │
           │                   └─────────────┬──────────┘                │
           │                                 ▼                           │
   ┌────────────────────────────────────────────────────────────────────┐
   │ ETL / orchestration: Apache Airflow (KubernetesExecutor)            │
   │  - 40+ source DAGs (World Bank, IMF, OECD, BLS, OECD.AI, AI Index,  │
   │    HuggingFace, GitHub, arXiv, O*NET, WIPO, Top500…)                │
   │  - dbt (warehouse transforms), Great Expectations (validation)     │
   │  - model training/scoring DAGs (Prophet/XGBoost/PyMC)              │
   └────────────────────────────────────────────────────────────────────┘

A.2 Frontend

  • Next.js 14+ (App Router) in TypeScript. React Server Components for data-dense pages (country/company/industry profiles render server-side against the API), ISR for index/leaderboard pages (revalidate on each ETL publish), client components for interactive widgets.
  • Visualization: ECharts + visx for time series/forecast fans; deck.gl/MapLibre for the geo explorer (countries → subnationals → cities choropleths); a "terminal" command palette (cmd-k) for power users à la Bloomberg.
  • State/data fetching: TanStack Query against the REST API; SWR for the GraphQL gateway. Auth via NextAuth/Auth.js with the platform's own OIDC.
  • Design system: shared component library, dark "terminal" theme primary.

A.3 Backend services (FastAPI/Python)

Each domain is an independently deployable FastAPI service (Python 3.12, async, Pydantic v2 models, Uvicorn/Gunicorn workers, served behind the gateway):

ServiceResponsibility
geo-svccountry / state-province / city entities, profiles, rankings
company-svccompany profiles, AI-adoption indicators, peer sets
industry-svcindustry (NAICS/ISIC) panels, exposure scores
occupation-svcoccupation (O*NET/SOC) automation/augmentation exposure
index-svcindex definitions, weights, on-the-fly + materialized index values
forecast-svcserves model forecasts + uncertainty bands; triggers retrains
scenario-svcparametric + Monte-Carlo scenario simulation engine
assessment-svcpersonal CC-exposure assessments (PII-bearing)
career-svccareer path / reskilling recommendations
export-svcbulk CSV/Parquet/JSON exports, async export jobs
catalog-svcdata dictionary, source provenance, methodology registry
auth/billing-svctenants, API keys, tiers, quotas, Stripe metering

Cross-cutting: shared cc-core Python package (Pydantic schemas, warehouse client, auth middleware, telemetry).

A.4 Data stores

  • Postgres 16 (managed, e.g. RDS/Cloud SQL/Neon) — OLTP: users, api_keys, subscriptions, scenarios (saved), assessments metadata, audit_log, and the metadata catalog (indicator definitions, sources, methodology versions). Logical replication to the warehouse for catalog joins.
  • ClickHouse — primary OLAP serving store: indicator_values, indices, index_values, forecasts, wide country/industry/occupation panels. MergeTree tables partitioned by (geo_level, year); materialized views precompute rankings and YoY deltas. Powers sub-100ms reads for the terminal.
  • DuckDB — embedded engine inside export-svc and analyst/notebook workloads; reads Parquet directly from the lakehouse for ad-hoc joins and per-request scenario slices without hitting ClickHouse.
  • Lakehouse — Parquet on S3/R2 with Apache Iceberg table format; raw → staged → curated zones. The immutable raw zone is the audit substrate for reproducibility.
  • Redis — gateway rate-limit counters, hot-response cache, Celery/RQ broker for async jobs.

A.5 Warehouse abstraction

A single WarehouseClient interface (in cc-core) abstracts the analytical backend so services are storage-agnostic:

class WarehouseClient(Protocol):
    def query(self, sql: str, params: dict) -> pyarrow.Table: ...
    def panel(self, entity, indicators, years) -> pyarrow.Table: ...

Implementations: DuckDBClient (v1, embedded), ClickHouseClient (scale), LakehouseClient (Parquet/Iceberg). The same SQL dialect surface (ANSI subset + standard window funcs) runs across DuckDB and ClickHouse, so the migration is a config flip, not a rewrite.

A.6 ETL / orchestration

  • Apache Airflow (KubernetesExecutor) runs one DAG per source plus transform/index/forecast DAGs (see etl_architecture_md).
  • dbt for warehouse transforms (raw → staging → marts), Great Expectations for validation gates, Soda/anomaly jobs for drift.

A.7 Infrastructure & platform

  • Containers: every service is a Docker image (distroless/slim Python base), built in CI, scanned (Trivy), signed (cosign).
  • Orchestration: Kubernetes (EKS/GKE or k3s on the homelab tier). HPA on the read services; KEDA scales Airflow workers on queue depth.
  • IaC: Terraform for cloud (VPC, managed PG, object storage, ClickHouse Cloud/Altinity, DNS, secrets) + Helm charts per service. GitOps via ArgoCD.
  • CI/CD: GitHub Actions → build/test/scan → push → ArgoCD sync. Blue/green for the gateway, rolling for services.
  • Observability: OpenTelemetry traces → Tempo/Jaeger; Prometheus + Grafana metrics; Loki logs; Sentry for app errors; index-svc/forecast-svc emit data-freshness and model-drift metrics.
  • Environments: dev (homelab/k3s), staging, prod.

---

B. v1 Consolidated Architecture (what actually shipped)

The deployed v1 is a single consolidated FastAPI application — deliberately one process, one repo, one deploy. It is the pragmatic compression of the target diagram into a monolith that preserves the seams needed to split later.

cognitivecoefficient/
└── app/
    ├── main.py              # FastAPI app: mounts routers, static, pages
    ├── pages/               # server-rendered Jinja/HTML pages (the "terminal")
    ├── static/
    │   ├── css/  js/  vendor/   # frontend assets, charting vendored client-side
    ├── data/                # DuckDB file(s) + curated Parquet/CSV snapshots
    ├── routers/             # /country /state /city /company /industry
    │                        # /occupation /index /forecast /scenario
    │                        # /assess /career /simulate
    ├── services/            # geo, index, forecast, scenario, assessment logic
    │                        #  = the future microservice boundaries, as modules
    ├── models/              # Pydantic schemas (shared with future cc-core)
    └── ml/                  # fitted forecast models + scenario engine

Key v1 properties:

  • One FastAPI process serves both the HTML pages (app/pages, app/static) and the JSON API (app/routers/*). No separate Next.js frontend yet — the "terminal" is server-rendered HTML + vendored JS charts. This is the app/{pages,static} layout actually on disk.
  • DuckDB as the only analytical engine, embedded, reading the curated Parquet/CSV in app/data/. No ClickHouse, no separate warehouse cluster. Indices and forecasts are precomputed offline and shipped as data files; the API serves slices of them.
  • SQLite/Postgres-lite for operational state (users, api_keys, saved scenarios) — a single relational DB; in the smallest deploy this can be a DuckDB/SQLite file, with a clean swap to Postgres.
  • ETL is offline batch scripts, not Airflow — Python collectors run on a schedule (cron/ctask) producing the Parquet snapshots committed/synced into app/data/. The DAG logic exists as ordered functions, ready to lift into Airflow tasks.
  • ML is "train offline, serve artifacts." Prophet/XGBoost/PyMC models are fit in notebooks/batch, serialized, and loaded by app/ml; the /forecast and /simulate endpoints serve precomputed + light on-request computation (Monte-Carlo runs bounded for latency).
  • Deploy: single container (or static-export + FastAPI) behind Caddy on netops01 at www.cognitivecoefficient.com (apex → www redirect, TLS terminated by Caddy). One image, docker compose up.
  • Auth/tiers: API-key middleware in-process; tier limits enforced with Redis or an in-memory/SQLite token bucket.

This is enough to be live, correct, and credibly "Bloomberg-for-AI" on day one, at a fraction of the operational cost.

---

C. Migration Path (v1 → target)

The consolidation was designed to decompose along pre-cut seams, in dependency order:

StepMoveTriggerHow (because seams already exist)
1Postgres for OLTPfirst paying tenants / multi-instanceSwap the relational DB URL; models/ Pydantic + repo layer already DB-agnostic.
2Split the frontend to Next.jsneed rich interactive terminal/mapsStand up Next.js consuming the same JSON routers; retire app/pages. API contract unchanged.
3Airflow for ETLsources/schedules outgrow cronWrap existing collector functions as Airflow tasks; output still Parquet → lakehouse.
4Lakehouse (Parquet/Iceberg on object store)data volume / reproducibility/audit needsPoint the warehouse client's raw/curated paths at S3/R2 instead of app/data/.
5ClickHouse servingread latency/concurrency at scaleImplement ClickHouseClient behind the existing WarehouseClient interface; flip config per env. DuckDB stays for exports.
6Carve out services (forecast-svc, scenario-svc, assessment-svc first — the compute/PII-heavy ones)independent scaling / isolation (esp. PII)Each services/<x> module + its router becomes a FastAPI service importing cc-core. Gateway routes by path.
7API Gateway + GraphQLexternal API customers, federationMove key/tier/rate-limit out of the monolith into Kong/APISIX; add Apollo/Strawberry GraphQL over the REST services.
8K8s + Terraform + ArgoCDreliability/SLA commitmentsContainers already exist; add Helm/Terraform/GitOps.

Invariant across the whole path: the public API contract (/country, /index, /forecast, …), the Pydantic schemas, and the WarehouseClient interface never change. v1 monolith and target mesh are the same API over a swappable substrate — which is what makes "consolidated FastAPI now, cloud-native later" a refactor rather than a rebuild.

Data Model

CognitiveCoefficient — Data Model

The data model has three concerns: (1) the analytical core (geographies, organizations, work, indicators, indices, forecasts, scenarios) — read-mostly, columnar-friendly; (2) operational state (users, api_keys, billing, saved scenarios) — relational/OLTP; (3) the metadata catalog (sources, methodology, provenance) — the glue that makes every value auditable.

Naming: snake_case; surrogate id keys (ULID/UUIDv7) for OLTP, stable natural codes (ISO/SOC/NAICS) for analytical dimensions. All fact tables carry source_id, methodology_version, and ingested_at for provenance.

---

1. Geographic dimension

countries

fieldtypenotes
country_idtext PKISO 3166-1 alpha-3 (USA, DEU)
iso2textISO 3166-1 alpha-2
nametextcanonical English name
regiontextUN/World Bank region
subregiontext
income_grouptextWorld Bank: Low/Lower-mid/Upper-mid/High
oecd_memberbool
populationbigintlatest
gdp_usdnumericlatest, nominal
gdp_per_capita_pppnumeric
centroid_geomgeometryfor map rendering
cc_rankintcurrent CognitiveCoefficient global rank

subnationals (states/provinces)

fieldtypenotes
subnational_idtext PKISO 3166-2 (US-CA, DE-BY)
country_idtext FK → countries
name, typetext"state","province","region"
population, gdp_usdnumeric
cc_rank_nationalintrank within country

cities

fieldtypenotes
city_idtext PKGeoNames id or country-slug
subnational_idtext FKnullable
country_idtext FK
name, metro_nametext
population, lat, lonnumeric
is_tech_hubbool
cc_rank_globalintcity-level CC rank

---

2. Organization & work dimensions

companies

fieldtypenotes
company_idtext PKULID; ticker/lei as natural keys
name, legal_nametext
ticker, lei, domaintextidentifiers
hq_country_id, hq_city_idFK
industry_idFK → industriesprimary
employee_count, market_cap_usdnumeric
founded_yearint
is_ai_nativebool
ai_adoption_scorenumeric0–100 composite
cc_rank_industryint

industries

fieldtypenotes
industry_idtext PKNAICS or ISIC code
code_systemtextNAICS \ISIC
name, sectortext
parent_industry_idtext FKself-ref hierarchy
levelint2/3/4-digit depth
automation_exposurenumeric0–1
augmentation_potentialnumeric0–1
cc_industry_indexnumeric

occupations

fieldtypenotes
occupation_idtext PKO*NET-SOC code (15-1252.00)
soc_codetext6-digit SOC
title, descriptiontext
job_zoneintO*NET 1–5
median_wage_usdnumericBLS OEWS
employmentbigintBLS
automation_exposurenumeric0–1 (task-level rollup)
augmentation_exposurenumeric0–1
ai_complementaritynumeric-1..1 (substitute↔complement)
cc_exposure_indexnumericheadline per-occupation CC
growth_outlooktextBLS projection band

occupation_tasks (bridge): occupation_id, task_id, task_desc, importance, ai_exposure — the task-level substrate that rolls up into occupation exposure.

---

3. Indicators (the raw measurement layer)

indicators (definition / dictionary)

fieldtypenotes
indicator_idtext PKe.g. ai_patents_per_capita
name, descriptiontext
unittext%, USD, count, index
entity_leveltextcountry\subnational\city\company\industry\occupation
source_idFK → sourcesprimary source
directiontexthigher_better\lower_better
frequencytextannual\quarterly\monthly
min_year, max_yearintcoverage
methodology_urltext

indicator_values (the fact table — largest)

fieldtypenotes
indicator_idFK
entity_leveltextpartition key
entity_idtextresolves to the right dimension by level
yearint(or period for sub-annual)
valuenumericraw measured value
value_normalizednumeric0–100 min-max/z-score (per methodology)
source_idFK
methodology_versiontext
confidencenumeric0–1 data-quality score
is_imputedboolflag for filled gaps
ingested_attimestamp

PK: (indicator_id, entity_level, entity_id, year, methodology_version). In ClickHouse: MergeTree PARTITION BY (entity_level, year) ORDER BY (indicator_id, entity_id).

---

4. Indices (the published products)

indices (index definitions)

fieldtypenotes
index_idtext PKcc_global, cc_industry, cc_occupation
name, descriptiontext
entity_leveltextwhat it ranks
methodology_versiontextcurrent published version
pillarsjsonbnamed sub-pillars
weightsjsonb{indicator_id: weight}
normalizationtextminmax\zscore\rank
aggregationtextweighted_mean\geometric
published_attimestamp

index_values

fieldtypenotes
index_idFK
entity_level, entity_id
yearint
scorenumeric0–100 headline
rank_global, rank_peerint
percentilenumeric
pillar_scoresjsonbbreakdown
methodology_versiontextreproducibility

---

5. Forecasts & scenarios

forecasts

fieldtypenotes
forecast_idtext PK
target_typetextindicator\index
target_idtextwhich indicator/index
entity_level, entity_id
modeltextprophet\xgboost\bass\logistic\gompertz\pymc
model_versiontext
horizon_yearsint
trained_attimestamp
backtest_mapenumericaccuracy metadata

forecast_values

fieldtypenotes
forecast_idFK
yearintfuture period
pointnumericmedian/expected
p05,p25,p50,p75,p95numericuncertainty bands
is_historicalboolfitted vs projected

scenarios

fieldtypenotes
scenario_idtext PKULID
owner_user_idFK → usersnull for system presets
name, descriptiontext
base_target_idtextindicator/index forecast it perturbs
assumptionsjsonblevers: adoption_rate, policy_shock, compute_growth, shock_year
methodtextparametric\monte_carlo\bayesian
n_drawsintMC sample size
is_publicbool
created_attimestamp

scenario_results (cached runs): scenario_id, year, p05..p95, point, run_id, computed_at.

---

6. Operational state (OLTP / Postgres)

users

fieldtypenotes
user_idtext PKUUIDv7
emailcitext UNIQUE
password_hashtextargon2id (null if SSO-only)
display_nametext
org_idFK → orgstenant
tiertextfree\pro\enterprise\gov\university
roletextmember\admin\owner
email_verified_at, created_at, last_login_attimestamp
mfa_enabledbool

orgs (tenants)

org_id PK, name, tier, seat_limit, data_residency, stripe_customer_id, created_at.

api_keys

fieldtypenotes
api_key_idtext PK
org_id, created_by_user_idFK
key_prefixtextshown in UI (cc_live_ab12…)
key_hashtextsha256/argon2 of full key — secret never stored
scopestext[]read:index, read:forecast, run:simulate, read:assess
tiertextfor rate-limit class
rate_limit_rpmintoverride
last_used_at, expires_at, revoked_attimestamp

subscriptions

subscription_id, org_id, plan, status, stripe_subscription_id, current_period_end, metered_usage jsonb.

assessments (personal — PII-bearing, isolated)

fieldtypenotes
assessment_idtext PK
user_idFK (nullable for anon)
occupation_idFKself-reported role
inputsjsonb (encrypted at rest)skills, tasks, region — PII
cc_personal_scorenumerictheir exposure result
recommendationsjsonbcareer-svc output
created_attimestamp
retention_expires_attimestampTTL for auto-purge

audit_log

audit_id, actor_user_id, org_id, action, target, ip, user_agent, at, metadata jsonb — append-only.

---

7. Metadata catalog (provenance)

sources

fieldtypenotes
source_idtext PKworld_bank, oecd_ai, bls, onet, wipo, top500, hf, github, arxiv
name, org, urltext
licensetextredistribution terms
update_frequencytext
last_ingested_attimestampfreshness
reliability_tierint1 (official stats) → 3 (scraped)

methodology_versions

methodology_version PK, index_id, changelog, effective_date, weights snapshot, author — every published number references the exact methodology that produced it, enabling deterministic re-derivation and audit.

---

Entity-relationship summary

countries 1─* subnationals 1─* cities
countries 1─* companies   *─1 industries (self-hierarchy)
occupations *─* occupation_tasks
indicators 1─* indicator_values ──┐
indices    1─* index_values       ├─ all FK source_id → sources
forecasts  1─* forecast_values    │              + methodology_version
scenarios  1─* scenario_results ──┘
orgs 1─* users 1─* api_keys ; orgs 1─* subscriptions
users 1─* scenarios ; users 1─* assessments(PII)
everything analytical → sources, methodology_versions  (provenance)

Data-Collection (ETL)

CognitiveCoefficient — ETL / Data-Collection Engine

The ETL engine is the moat: a continuously-updated, validated, normalized, fully-provenanced panel of the intelligence economy assembled from ~13 authoritative sources. The mantra: never trust a number without a source, a method, and a freshness stamp.

---

1. Sources

SourceWhat it providesEntity levelCadenceAccess
World Bank (WDI/Indicators API)GDP, population, R&D %GDP, education, internet/infracountry/subnationalannualREST/JSON
IMF (SDMX / IFS, WEO)macro, productivity, GDP forecastscountryquarterly/annualSDMX
OECD (SDMX)productivity, ICT, skills, R&D, business demographycountryannualSDMX
OECD.AI Policy ObservatoryAI policies, AI investment, AI jobs, computecountryirregularAPI/CSV
Stanford AI Indexmodel counts, training compute, papers, $ investment, benchmarkscountry/globalannualCSV/report
BLS (OEWS, Employment Projections)wages, employment, occupation projectionsoccupationannualAPI/flat files
US Census / ACSindustry employment, business formation, demographicssubnational/city/industryannualAPI
**O*NET**task/skill/ability content per occupationoccupation/tasksemi-annualDB download
WIPO (PATENTSTAT / IP stats)AI patent filings/grantscountry/companyannualbulk/API
HuggingFace Hubmodel/dataset counts, downloads, org activitycompany/country (by org)continuousAPI
GitHubAI-repo activity, stars, contributors, language trendscompany/countrycontinuousGraphQL API
arXivAI/ML paper volume, authorship, affiliationscountry/companycontinuousOAI-PMH/API
Top500 / Green500supercomputer/compute capacity by country/orgcountry/companysemi-annualscrape/CSV

(Extensible: Crunchbase/PitchBook for funding, LinkedIn-style talent feeds, Epoch AI compute, etc. as licensing allows.)

---

2. Orchestration — Airflow (target)

One source DAG per provider, plus transform/index/forecast DAGs, on a global schedule:

world_bank_ingest        @yearly + monthly freshness check
imf_ingest               @quarterly
oecd_sdmx_ingest         @yearly
oecd_ai_ingest           @weekly
stanford_ai_index_ingest @yearly (+ manual on release)
bls_ingest               @yearly
census_acs_ingest        @yearly
onet_ingest              @monthly
wipo_ingest              @yearly
huggingface_ingest       @daily
github_ingest            @daily
arxiv_ingest             @daily
top500_ingest            @monthly
   │
   ▼ (all land raw → lakehouse raw zone)
transform_normalize_dag  @daily   (dbt: raw→staging→marts, normalization)
build_indices_dag        triggered on new marts (index-svc methodology)
train_forecasts_dag      triggered on new indices (ML DAGs)
publish_dag              loads ClickHouse serving tables + invalidates ISR cache
data_quality_dag         Great Expectations gates + anomaly scan (every run)

Each source DAG: extract → land_raw → validate → normalize → upsert_staging. KubernetesExecutor scales workers; KEDA on queue depth. Backfills are date-partitioned and idempotent. Secrets (API tokens) from the secrets manager, never in code.

v1 shipped: these are ordered Python collector scripts run by cron/ctask, each emitting curated Parquet/CSV into app/data/; the DAG topology above is the lift-and-shift target. Same extract/validate/normalize functions, wrapped as Airflow tasks later.

---

3. Pipeline stages

3.1 Extract & land (raw zone)

Pull via each source's native protocol (REST/SDMX/OAI-PMH/GraphQL/bulk). Land raw, immutable, partitioned by source/ingest_date in object storage (Parquet) — this raw zone is the audit substrate and enables full re-derivation. Capture HTTP metadata, response hash, and row counts.

3.2 Validate (gate)

Great Expectations suites per source: schema/type conformance, value ranges (e.g. shares ∈ [0,1], wages > 0), referential integrity (entity codes resolve to dimensions), expected row-count/coverage. A failed suite quarantines the batch (does not publish) and alerts — bad data never reaches the warehouse.

3.3 Normalize & conform

  • Entity resolution: map every source's geo/company/industry/occupation identifiers to canonical keys (ISO-3166, NAICS/ISIC, ONET-SOC, internal company_id). Crosswalk tables maintained in the catalog (e.g. SOC↔ONET, NAICS↔ISIC, GitHub-org↔company).
  • Unit & currency harmonization: convert to common units, deflate/PPP-adjust where needed.
  • Indicator normalization: min-max → 0–100 and/or z-score per methodology, with direction (higher/lower-better) applied. Output indicator_values.value + value_normalized.
  • Temporal alignment: annualize/interpolate sub-annual series to the common panel grid; flag interpolated points.
  • Imputation: model-based gap-fill (XGBoost) for sparse-but-needed indicators, always written with is_imputed = true and a confidence score.

3.4 Load & publish

dbt builds staging→marts (the conformed panel); build_indices_dag computes index_values from current methodology_version; serving tables loaded to ClickHouse (v1: DuckDB files in app/data/); CDN/ISR cache invalidated so the terminal shows fresh numbers.

---

4. Data quality & anomaly detection

  • Validation gates (above) block structurally-bad batches.
  • Statistical anomaly detection on every refresh: year-over-year jumps beyond entity-specific thresholds, outliers vs. peer distribution (robust z / IQR), break-point detection on long series, and cross-source disagreement checks (e.g. World Bank vs. IMF GDP must agree within tolerance). Anomalies are flagged, not silently dropped — surfaced to a review queue and lower the confidence score on affected values.
  • Confidence scoring: every indicator_value carries confidence (0–1) derived from source reliability_tier, recency, imputation flag, and anomaly status — this propagates into index uncertainty.
  • Freshness SLAs: sources.last_ingested_at monitored; stale sources alert and the catalog/UI badge the data age. Lineage (raw → staging → mart → index → forecast) is tracked end-to-end so any published number traces to its raw rows, source, and methodology version.
  • Observability: each DAG emits row counts, null rates, validation pass/fail, and runtime to Prometheus/Grafana; failures page via the alerting channel.

Forecasting & ML

CognitiveCoefficient — Forecasting & ML Architecture

The ML layer answers two questions the indices alone cannot: "where is each entity headed?" (forecasting) and "what if?" (scenario simulation). Both are first-class, uncertainty-quantified products, not decorations on a chart.

Design stance: technology diffusion is the right prior. AI adoption across countries/industries/occupations follows S-curves, so the model library is anchored on diffusion theory (Bass, logistic, Gompertz) and augmented with statistical/ML models (Prophet, XGBoost) and a Bayesian core (PyMC) for principled uncertainty.

---

1. Model library & roles

1.1 Diffusion / adoption curves (structural backbone)

  • Bass diffusion — the workhorse for adoption of a technology in a population (share of firms/workers using AI, share of tasks automated). Parameters: p (innovation/external influence), q (imitation/word-of-mouth), m (market potential / ceiling). Fit by NLS or Bayesian; gives interpretable, bounded, theory-grounded trajectories.
  • Logistic (Verhulst) — symmetric S-curve for saturating indicators (compute density, model-capability proxies) where inflection is near the midpoint.
  • Gompertz — asymmetric S-curve (slow start, fast middle, long tail) — better for capability/penetration metrics that decelerate late. We fit all three and select by backtest + information criteria per series.
  • Role: primary for bounded, saturating indicators and indices (adoption %, exposure share). Their parameters are also the levers the scenario engine perturbs.

1.2 Prophet

  • Role: fast, robust forecasting for indicators with trend + seasonality + holidays/regime shifts and irregular history (many official-stat series are annual/quarterly with gaps). Used as a strong baseline and for series without a clear saturation ceiling. Native uncertainty intervals; decomposable (trend/seasonal components shown in the terminal).

1.3 XGBoost (gradient-boosted trees)

  • Role: cross-sectional + panel forecasting where an entity's trajectory depends on covariates — e.g. predict a country's next-year CC pillar from its GDP, R&D spend, patent flow (WIPO), GitHub/HF activity, talent metrics. Captures nonlinear interactions across the rich indicator feature space. Trained on the panel (entity × year × indicators); quantile objective (reg:quantileerror) yields p05/p50/p95 directly. Also powers gap-filling/imputation of sparse indicators (with is_imputed flagged).

1.4 PyMC (Bayesian core)

  • Role: the principled-uncertainty engine. Three uses:
  1. Bayesian Bass/logistic — priors on p,q,m (informed by peer entities / hierarchical pooling), full posterior over curve parameters → posterior predictive trajectories with honest bands, even with short history.
  2. Hierarchical models — partial pooling across entities (countries within a region, occupations within a family) so data-poor entities borrow strength from data-rich peers.
  3. Scenario priors — analyst assumptions enter as priors/interventions, propagated to the posterior.
  • MCMC (NUTS) for inference; convergence reported (r_hat, ESS) and surfaced in /simulate responses.

1.5 Model selection

Per (target, entity) we maintain a small ensemble/championship: fit Bass/logistic/Gompertz/Prophet/XGBoost, backtest (rolling-origin), and pick by out-of-sample error (MAPE/MASE) with a structural-model tie-break (prefer diffusion models when fit is comparable, for interpretability and lever-ability). The chosen model + model_version is stored on every forecast row.

---

2. Scenario simulation

Two modes, both exposed via /scenario and /simulate:

2.1 Parametric / deterministic

Analyst sets levers (adoption rate p,q, ceiling m, a policy_shock at year t, a compute_growth multiplier). Engine perturbs the fitted diffusion parameters and re-projects a single trajectory + analytically-propagated band. Fast, used for the interactive sliders in the terminal.

2.2 Monte-Carlo

Sample parameters/shocks from their distributions (n_draws 1k–50k by tier), propagate each through the curve, and aggregate to a fan (p05…p95). Captures interaction of multiple uncertain levers (e.g. uncertain q × stochastic policy shock). Vectorized (NumPy) for latency; large runs go async via the job queue and cache to scenario_results.

2.3 Bayesian scenarios

The PyMC posterior is the scenario distribution: interventions are applied with pm.do/pm.Intervention-style do-operations, and the posterior predictive gives the counterfactual fan with full parameter uncertainty. This is the rigorous tier (Enterprise/Government) for policy analysis.

---

3. Uncertainty quantification (non-negotiable)

Every forecast/scenario ships bands, not just points:

  • Bayesian models → posterior predictive quantiles.
  • XGBoost → quantile regression (p05/p25/p50/p75/p95).
  • Prophet → built-in intervals (with mcmc_samples for full Bayesian intervals on key series).
  • Diffusion NLS → bootstrap/delta-method bands.

Bands are stored (forecast_values.p05..p95), rendered as fans in the UI, and returned in the API. We also publish backtest accuracy (backtest_mape) alongside each forecast so users can weight it — honesty about uncertainty is a product differentiator vs. point-estimate competitors.

---

4. Training, serving, and the v1 path

  • Target: training runs as Airflow DAGs (one per model-family/target) on a schedule after ETL publishes new data; artifacts (fitted params, serialized models, posterior summaries) versioned in a model registry (MLflow). forecast-svc loads artifacts and serves; scenario-svc runs Monte-Carlo/Bayesian on demand. Heavy training on GPU/CPU workers (the homelab 6×3090 tier is ideal for PyMC/XGBoost sweeps).
  • v1 shipped: train offline, serve artifacts. Models are fit in batch/notebooks, serialized into app/ml/, and the consolidated FastAPI app loads them at startup. /forecast serves precomputed forecast_values; /simulate runs bounded Monte-Carlo in-process (draw caps + async fallback) so latency stays sane without a separate service. Same model code, same artifacts — only the orchestration and serving topology change on the way to target.

---

5. Model monitoring & governance

  • Drift detection: PSI/KS tests on input indicator distributions per ETL run; when inputs drift beyond threshold, flag for retrain.
  • Forecast accuracy tracking: as actuals arrive, compute realized error vs. prior forecasts → rolling MAPE/MASE dashboards per model/entity; auto-demote a champion if a challenger beats it.
  • Anomaly guardrails: reject/quarantine forecasts that violate structural constraints (e.g. adoption > ceiling, negative counts).
  • Reproducibility: model_version + input methodology_version + raw-data snapshot pin every forecast; any published number can be re-derived.
  • Reporting: convergence diagnostics (r_hat, ESS) surfaced to users for Bayesian runs; backtest cards published in the catalog. Metrics flow to Prometheus/Grafana; retraining and demotions are logged to audit_log.

API Specification

Cognitive Capital API — Specification

Base URL: https://api.cognitivecoefficient.com/v1 Content type: application/json. All timestamps ISO-8601 UTC. All money in USD unless noted. The same API backs the web terminal (v1: served by the consolidated FastAPI app) and external customers.

---

Authentication

Two mechanisms:

  • API key (machine): header Authorization: Bearer cc_live_<key> or X-API-Key: cc_live_<key>. Keys are tier-scoped and scope-limited (read:index, read:forecast, run:simulate, read:assess…).
  • OIDC/JWT (browser/session): short-lived access token from the platform's auth service; refresh via cookie. Used by the web terminal.

Keys are shown once on creation; only a hash is stored. Test keys use the cc_test_ prefix and read frozen fixture data.

401 unauthenticated · 403 scope/tier denied · 429 rate-limited

---

Tiers & rate limits

TierRequests/minDaily capConcurrencyForecast horizonBulk exportScopes
Free301,0002≤3 yrsread:* (current year + 5y history)
Professional12050,00010≤10 yrsCSV ≤1M rows+ run:simulate, full history
API (metered)600per-plan50≤15 yrsParquet/JSONall read + simulate
Enterprise2,000custom200≤25 yrsfull lakehouse exportall + read:assess (own tenant)
Governmentcustomcustomcustomfullfullall + subnational deep-dives
University300100,00020≤15 yrsCSV/Parquet (research)all read + simulate

Rate-limit headers on every response:

X-RateLimit-Limit: 120
X-RateLimit-Remaining: 117
X-RateLimit-Reset: 1750700000
Retry-After: 12        # only on 429

Limits enforced at the gateway with a Redis token bucket; metered tiers also emit usage events to billing.

Pagination: ?limit= (default 50, max 1000) + ?cursor=. Filtering: ?year=, ?from=&to=, ?fields=. Standard error envelope:

{ "error": { "code": "tier_forbidden", "message": "Forecast horizon >10y requires Enterprise.", "request_id": "req_01J..." } }

---

Endpoints

Geographies

GET /country                       list + rank, filter by region/income_group
GET /country/{iso3}                profile + headline CC + pillars
GET /country/{iso3}/indicators     all indicator_values (year filters)
GET /country/{iso3}/forecast       forecast for cc_global (horizon by tier)
GET /state/{iso3166_2}             subnational profile (e.g. US-CA)
GET /state?country={iso3}          subnationals of a country, ranked
GET /city/{city_id}                city profile + rank
GET /city?country=&tech_hub=true   city search/leaderboard

Organizations & work

GET /company/{company_id}          company profile + ai_adoption_score
GET /company?industry=&country=    search / peer leaderboard
GET /industry/{code}               industry panel + exposure scores
GET /industry/{code}/occupations   occupations weighted into the industry
GET /occupation/{soc}              occupation exposure + tasks + wage/employment
GET /occupation?exposure_gt=0.6    screen occupations by exposure

Indices, forecasts, scenarios

GET /index                         list published indices + methodology_version
GET /index/{index_id}              definition, weights, pillars
GET /index/{index_id}/values       ranked index_values (entity_level, year)
GET /forecast/{target_id}          point + p05..p95 bands, model metadata
POST /scenario                     create a scenario (assumptions → run)
GET  /scenario/{scenario_id}       fetch a saved/preset scenario + results
POST /simulate                     run an ad-hoc scenario (no persistence)

Personal (PII-bearing, scope read:assess)

POST /assess                       personal CC-exposure assessment
GET  /assess/{assessment_id}       retrieve (owner only)
GET  /career?occupation=&region=   career/reskilling recommendations

Meta

GET /catalog/indicators            data dictionary
GET /catalog/sources               provenance + freshness
GET /export                        create async bulk export job
GET /export/{job_id}               poll export job → signed URL
GET /healthz  /version  /openapi.json

---

Example responses

GET /country/USA

{
  "country_id": "USA", "name": "United States", "region": "North America",
  "income_group": "High", "cc_global": {
    "score": 87.4, "rank_global": 1, "percentile": 99.6, "year": 2025,
    "methodology_version": "cc_global@2025.2",
    "pillars": { "research": 91.2, "talent": 84.0, "infrastructure": 88.7,
                 "adoption": 85.1, "policy": 79.3 }
  },
  "trend": { "2021": 79.1, "2022": 81.6, "2023": 83.9, "2024": 85.8, "2025": 87.4 },
  "sources": ["oecd_ai","stanford_ai_index","world_bank","wipo","github"]
}

GET /occupation/15-1252.00 (Software Developers)

{
  "occupation_id": "15-1252.00", "title": "Software Developers",
  "median_wage_usd": 132270, "employment": 1692100, "job_zone": 4,
  "cc_exposure": { "automation_exposure": 0.31, "augmentation_exposure": 0.78,
                   "ai_complementarity": 0.62, "cc_exposure_index": 58.3 },
  "interpretation": "High augmentation, low displacement — AI complements core tasks.",
  "growth_outlook": "Much faster than average",
  "top_exposed_tasks": [
    { "task": "Write/test/debug code", "ai_exposure": 0.84 },
    { "task": "Design software architecture", "ai_exposure": 0.41 }
  ]
}

POST /simulate

// request
{ "target_id": "cc_industry:5415", "method": "monte_carlo", "n_draws": 5000,
  "horizon_years": 8,
  "assumptions": { "adoption_curve": "bass", "p": 0.03, "q": 0.45,
                   "policy_shock": { "year": 2027, "delta": -0.1 } } }
// response
{ "target_id": "cc_industry:5415", "method": "monte_carlo", "n_draws": 5000,
  "trajectory": [
    { "year": 2026, "p50": 61.2, "p05": 58.1, "p95": 64.0 },
    { "year": 2030, "p50": 78.9, "p05": 70.4, "p95": 86.1 },
    { "year": 2033, "p50": 88.3, "p05": 79.0, "p95": 94.7 }
  ],
  "convergence": { "rhat_max": 1.01, "ess_min": 1820 },
  "model_version": "scenario_engine@1.4" }

POST /assess (PII)

// request
{ "occupation_soc": "43-3031.00", "region": "US-TX",
  "skills": ["bookkeeping","quickbooks","reconciliation"], "tenure_years": 8 }
// response
{ "assessment_id": "asm_01J...", "cc_personal_score": 71.5,
  "exposure_band": "high-automation",
  "drivers": [ {"factor":"routine data entry","weight":0.4},
               {"factor":"rules-based reconciliation","weight":0.3} ],
  "recommendations": [
    { "path": "FP&A / financial analyst", "skill_gap": ["modeling","sql"],
      "transition_difficulty": "moderate", "wage_delta_usd": 24000 }
  ],
  "retention_expires_at": "2026-12-23T00:00:00Z" }

---

GraphQL

A federated GraphQL gateway (https://api.cognitivecoefficient.com/graphql) sits over the same services for clients that need precise field selection and cross-entity joins in one round-trip — e.g. "countries in the OECD with their top-3 most AI-exposed industries and 2030 forecast bands." Same auth/tiers/rate-limits as REST; query cost analysis caps complexity per tier (Free 1k points, Enterprise 50k). Persisted queries required for Free tier.

query { country(iso3:"DEU") {
  name ccGlobal { score rank }
  industries(first:3, orderBy: EXPOSURE_DESC) {
    name automationExposure forecast(year:2030){ p50 p05 p95 } } } }

Bulk export

GET /export creates an async job (large extracts never block a request). Tiers gate format and volume: Free none; Pro CSV ≤1M rows; API/University Parquet+JSON; Enterprise/Government signed access to curated lakehouse partitions (Parquet/Iceberg) with row/column scoping by tenant. Jobs poll to a time-limited signed URL; every export is recorded in audit_log with the requesting key and filter set.

Security & Privacy

CognitiveCoefficient — Security & Privacy Architecture

Threat posture: a multi-tenant data platform serving Free→Government tiers, holding (a) valuable proprietary indices/forecasts, (b) customer billing data, and (c) sensitive personal career-exposure assessments (PII). The crown jewels are tenant isolation and PII protection; the published indices are public-facing but their unpublished forecasts and bulk extracts are gated.

---

1. Authentication

  • Users: OIDC/OAuth2 (platform IdP + optional SSO/SAML for Enterprise/Government/University tenants). Passwords (where used) argon2id; MFA (TOTP/WebAuthn) available, mandatory for admin/owner and all Government tenants.
  • Sessions: short-lived JWT access tokens (≤15 min) + rotating refresh tokens in HttpOnly, Secure, SameSite=Strict cookies. Token revocation list checked at the gateway.
  • API keys: prefixed (cc_live_/cc_test_), high-entropy, shown once, stored only as a hash (key_hash). Keys carry scopes, tier, expires_at; rotation and revocation are first-class (revoked_at). Per-key last_used_at enables stale-key detection.

2. Authorization

  • RBAC within a tenant (owner/admin/member) + scope-based authz on API keys (read:index, run:simulate, read:assess…).
  • Tier enforcement at the gateway: horizon limits, export formats, endpoint access (e.g. subnational deep-dives gated to Government, read:assess to the owning tenant only).
  • Deny-by-default; every privileged action authorized server-side (never trust the client).

3. Tenant isolation

  • Shared-schema, tenant-scoped: every OLTP row carries org_id; Postgres Row-Level Security policies enforce org_id = current_setting('app.org_id') on all tenant tables — defense-in-depth so an app-layer bug can't cross tenants. Set per-request from the authenticated principal.
  • Analytical data is public/global (indices/indicators), so the warehouse isn't tenant-partitioned — but tenant-specific artifacts (saved scenarios, assessments, exports) are org_id-scoped and RLS-protected.
  • Enterprise/Government can opt into stronger isolation: dedicated schema or dedicated DB, and data-residency pinning (orgs.data_residency) for region-locked deployments.
  • Export jobs are row/column-scoped to the requesting tenant's entitlements; signed URLs are short-lived and single-tenant.

4. PII handling (personal assessments)

Career assessments contain self-reported role/skills/region — treated as sensitive PII:

  • Data minimization: collect only what the assessment needs; anonymous assessments allowed (no user_id).
  • Encryption at rest of assessments.inputs (column-level / envelope encryption via KMS) — separate from the public analytical store.
  • Logical separation: PII lives in the OLTP/assessment-svc boundary, never copied into the analytical warehouse or training data without aggregation/anonymization.
  • Retention & right-to-erasure: retention_expires_at TTL auto-purges; user-initiated delete cascades and is logged. GDPR/CCPA DSAR (access/export/delete) supported.
  • No PII in logs/telemetry: scrubbing middleware; assessment inputs never logged in plaintext. Aggregate/anonymized stats only for analytics.
  • Consent: explicit consent gate before an assessment is stored; clear processing notice.

5. Secrets management

  • All secrets (DB creds, source API tokens, KMS keys, Stripe keys, signing keys) in a secrets manager (Vault / cloud secrets), injected at runtime — never in images, env files in git, or code. (Homelab tier: secrets in gitignored secrets/, never committed — consistent with repo policy.)
  • Short-lived, rotated DB credentials; per-service least-privilege DB roles. Key/secret rotation runbooks.

6. Transport & network security

  • TLS 1.2+ everywhere, HSTS, modern cipher suites; TLS terminated at the edge (Cloudflare/Caddy). v1: Caddy on netops01 auto-manages certs for *.cognitivecoefficient.com.
  • WAF + rate limiting + bot/DDoS at the edge; API gateway adds per-key/tier quotas (Redis token bucket) — abuse and scraping protection for the valuable data.
  • Network segmentation: services in private subnets; only the gateway/edge is public. mTLS within the service mesh (target). DB/warehouse never internet-exposed.
  • CORS locked to first-party origins for cookie-auth routes; API-key routes are origin-agnostic but key-bound.

7. Application security

  • Input validation via Pydantic v2 on every endpoint; parameterized queries only (no string-built SQL) — SQLi-proof.
  • Standard headers: CSP, X-Content-Type-Options, X-Frame-Options/frame-ancestors, Referrer-Policy.
  • Dependency scanning (Dependabot/Trivy), image signing (cosign), SAST in CI, periodic pen-tests. Pinned, scanned base images (distroless).
  • GraphQL query-cost/depth limits per tier to prevent expensive-query DoS.

8. Audit & compliance

  • Append-only audit_log for security-relevant events: logins, key create/revoke, exports, assessment access/delete, admin actions, tier changes — with actor, org_id, IP, UA, timestamp.
  • Centralized, tamper-evident log shipping; retention per compliance needs.
  • Compliance targets: SOC 2 Type II (Enterprise/Government readiness), GDPR/CCPA (PII), and government-tenant requirements (e.g. FedRAMP-style controls, data residency) as the customer base warrants.
  • Incident-response runbook, breach-notification process, and regular access reviews (stale keys, dormant accounts).

Roadmap

CognitiveCoefficient — Roadmap (v1 → global standard)

Each phase is gated by a concrete, demonstrable milestone. The architecture (consolidated FastAPI → cloud-native mesh) is decomposed in step with these phases, never ahead of need.

---

Phase 0 — v1 Live (shipped) ✅

  • Consolidated FastAPI app serving the terminal (server-rendered pages + static charts) and the JSON API at www.cognitivecoefficient.com (Caddy/TLS on netops01).
  • DuckDB + curated Parquet/CSV (app/data/) as the analytical engine; offline collector scripts for the seed sources; precomputed indices + forecast artifacts in app/ml.
  • Core endpoints live (/country /state /city /company /industry /occupation /index /forecast /scenario /assess /career /simulate), API-key auth, Free/Pro tier gating.
  • Exit: real data for a credible seed set of countries/industries/occupations, all routes 200, public indices browseable.

Phase 1 — Data depth & credibility (0–3 mo)

  • Full ETL coverage: all 13 sources ingesting; entity-resolution crosswalks; validation gates (Great Expectations) and anomaly/confidence scoring.
  • Methodology v1 published (transparent weights, versioned) — the credibility cornerstone.
  • Postgres for OLTP; user accounts, API keys, billing (Stripe) live; Pro tier self-serve.
  • Exit: defensible, sourced, reproducible indices across ≥150 countries + subnationals + key industries/occupations; methodology whitepaper out.

Phase 2 — Product depth & the terminal (3–6 mo)

  • Next.js frontend replacing server pages: geo explorer (deck.gl maps), forecast fans, scenario sliders, cmd-k terminal, watchlists/alerts.
  • Forecasting matured: full model championship (Bass/logistic/Gompertz/Prophet/XGBoost/PyMC), backtest cards, uncertainty bands everywhere.
  • Scenario engine GA (parametric + Monte-Carlo); personal assessment + career recommendations polished.
  • Exit: a genuinely "Bloomberg-terminal-feeling" product; Pro conversions; first API customers.

Phase 3 — Scale & platform hardening (6–12 mo)

  • Migrate analytical serving to ClickHouse; lakehouse on object store (Parquet/Iceberg); Airflow orchestration replaces cron; MLflow model registry.
  • Carve out compute/PII-heavy services (forecast-, scenario-, assessment-svc); API gateway + rate-limit/billing externalized; GraphQL gateway.
  • K8s + Terraform + ArgoCD; observability (OTel/Prometheus/Grafana/Sentry); SOC 2 Type II readiness; PII controls (encryption, RLS, erasure) audited.
  • Exit: sub-100ms reads at scale, multi-tenant isolation proven, first Enterprise contracts and an API business line.

Phase 4 — Authority & the standard (12–24 mo)

  • University + Government tiers: citable methodology (DOIs), policy-scenario modeling, subnational/city depth, on-prem/region-locked options.
  • Annual flagship "State of the Intelligence Economy" report + live index (the OECD-AI-Index / Stanford-AI-Index analog) — press, citations, authority.
  • Bayesian scenario tier GA; cross-source rigor; data-quality SLAs.
  • Exit: CC cited by researchers, journalists, and at least one government/IGO using it for policy; recognized methodology.

Phase 5 — Global reference / Bloomberg-of-AI (24 mo+)

  • Network-effect flywheel: free public index drives authority → Enterprise/Gov pay to align to "the CC standard"; API embedded across fintech/HR-tech/edtech.
  • Real-time/continuous indicators (HF/GitHub/arXiv live feeds), expanded coverage (every country + major subnationals + thousands of companies/occupations), partner data marketplace.
  • Multi-region deployment, full compliance (SOC 2, GDPR/CCPA, gov frameworks), white-label/embedded offerings.
  • North star: when a strategist, minister, or investor wants the authoritative read on the intelligence economy, CognitiveCoefficient is the default — the Bloomberg Terminal / OECD AI Index / World Bank AI Observatory of machine intelligence, in one platform.