281
Entities
36
Editions
9,536
Country-Year Records
1,071,603
Parsed Data Fields
Launch Interactive Archive
Search, browse, and analyze all 1,071,603 fields across 36 years

ABOUT About This Archive

The CIA World Factbook was published continuously from 1962 until its discontinuation on February 4, 2026. This archive preserves every publicly available edition from 1990 to 2025 in a structured, queryable SQL Server database.

Data was collected from three sources: Project Gutenberg and CIA original text files (1990s plain text editions), the Wayback Machine (2000-2020 HTML archives), and the factbook/cache.factbook.json GitHub repository (2021-2025 JSON snapshots).

YearsSourceFormat
1990-1999Project Gutenberg + CIA original (1996)Plain text (4 format variants)
2000-2020Wayback MachineHTML zip archives (5 parser generations)
2021-2025GitHub (factbook/cache.factbook.json)JSON with year-end git snapshots
SCHEMA Database Schema
TableRowsDescription
MasterCountries281Canonical entities with EntityType, ISO codes
Countries9,536Per-year country records
CountryCategories83,682Section headings (Geography, Economy, etc.)
CountryFields1,071,603Individual data fields (~238 MB)
FieldNameMappings1,132Maps 1,132 variants to 416 canonical names
GALLERY Application Screenshots

RESTORE How to Restore

The data is provided as SQL INSERT scripts compatible with SQL Server 2017+. The CountryFields table is split into 36 gzipped files by year (~59 MB total compressed).

To restore the archive locally:

1. Create a database: CREATE DATABASE CIA_WorldFactbook
2. Run schema/create_tables.sql
3. Import data/master_countries.sql, then countries.sql, categories.sql, field_name_mappings.sql
4. Decompress and import each data/fields/country_fields_YYYY.sql.gz
View on GitHub Download Data

ETL ETL Pipeline & Python Scripts

The raw CIA World Factbook changed format at least 10 times between 1990 and 2025. The ETL pipeline downloads original source data, parses every format variant, and loads structured results into SQL Server via pyodbc.

ScriptYearsWhat It Does
build_archive.py2000–2020Downloads HTML zips from the Wayback Machine, detects which of 5 HTML layouts each year uses, and parses fields
load_gutenberg_years.py1990–2001Parses plain-text Gutenberg editions with 4 distinct format variants (tagged, asterisk, at-sign, colon); 1996 supplemented by CIA original text
reload_json_years.py2021–2025Checks out year-end git commits from factbook/cache.factbook.json and loads structured JSON
build_field_mappings.pyAllMaps 1,132 raw field name variants to 416 canonical names using a 7-rule system
classify_entities.pyAllAuto-classifies 281 entities into 9 types based on Dependency Status and Government Type fields
validate_integrity.pyAllRead-only validation with 9 checks: field count benchmarks, ground truth, year-over-year consistency
parse_field_values.pyAllDecomposes 1,071,603 text blobs into 1,775,588 typed sub-values using 55 field-specific parsers + SourceFragment provenance (land/water, male/female, sex ratios, literacy, revenues/expenditures, age brackets, CO2 emissions, etc.)
validate_field_values.pyAllValidates parsed FieldValues: coverage (98.9%), numeric extraction rate (61.5%), spot checks against known ground truth
NEW Structured Field Data (FieldValues)

The 1,071,603 fields in CountryFields store raw text blobs with pipe (|) delimiters separating sub-fields. The structured parsing pipeline decomposes these into 1,775,588 individually queryable, typed sub-values across 2,599 distinct sub-fields using 55 dedicated parsers. Each row includes a SourceFragment showing the exact text slice that produced the value. This enables SQL queries that were previously impossible without per-query regex.

New queries enabled: Land vs water area, male vs female life expectancy, age structure brackets (0-14, 15-64, 65+), budget revenues vs expenditures, dependency ratios (youth/elderly), urbanization rate, elevation extremes, land use composition, multi-year GDP/inflation/unemployment breakdowns, and more.

Download: factbook.db (~662 MB) from Release v3.5 — single self-contained database
Live dashboard: worldfactbookarchive.org/analysis/structured-data — interactive charts with SQL and source data tabs
PARSING Why Parsing Was So Difficult

The CIA never maintained a stable schema. Every few years the HTML layout changed completely, field names were renamed without notice, and entire categories were restructured.

1990–1999 (Plain Text)
Four different formatting conventions across the decade. 1990–1993 used indented fields. 1994 introduced tagged markers. 1996 switched to bare section headers. 1999 changed the delimiter scheme again. Each variant required its own regex-based parser.

2000–2020 (HTML)
The CIA redesigned the Factbook website at least 5 times. The 2000 edition used inline <b> formatting. By 2004 it switched to table layouts. 2008 introduced CollapsiblePanel JavaScript widgets. 2014 changed to expand/collapse sections. 2017 moved to field-anchor div structures. A parser that worked on 2006 data would produce garbage on 2010 data.

Field Name Drift
The CIA renamed fields silently over the decades. “GDP - real growth rate” became “Real GDP growth rate.” “Telephones” split into “Telephones - fixed lines” and “Telephones - mobile cellular.” The field mapping script tracks all 1,132 variants through 7 rule layers.

EDITIONS Year-by-Year Breakdown
YearSourceCountriesFields
1990TEXT24915,750
1991TEXT24714,903
1992TEXT26417,372
1993TEXT26618,509
1994TEXT26618,761
1995TEXT26619,599
1996TEXT26620,764
1997TEXT26623,405
1998TEXT26623,524
1999TEXT26625,178
2000HTML26725,724
2001TEXT26527,281
2002HTML26827,430
2003HTML26828,676
2004HTML27128,958
2005HTML27128,728
2006HTML26228,950
2007HTML25929,096
2008HTML26130,753
2009HTML26030,818
2010HTML26230,805
2011HTML26233,634
2012HTML26235,183
2013HTML26736,729
2014HTML26736,679
2015HTML26636,868
2016HTML26836,804
2017HTML26837,046
2018HTML26837,285
2019HTML26837,394
2020HTML26836,687
2021JSON26039,714
2022JSON26037,344
2023JSON26037,558
2024JSON26034,838
2025JSON26032,594

ENTITIES Entity Types
TypeCountDescription
sovereign192Independent states
territory65Dependencies, overseas territories
misc7Oceans, World, European Union
disputed6Kosovo, Gaza Strip, West Bank, etc.
crown dep.3Jersey, Guernsey, Isle of Man
freely assoc.3Marshall Islands, Micronesia, Palau
special admin2Hong Kong, Macau
dissolved2Netherlands Antilles, Serbia and Montenegro
antarctic1Antarctica

FIELDS Field Name Standardization

The CIA renamed many fields over the 36-year span. The FieldNameMappings table maps 1,132 raw field name variants to 416 canonical names:

Mapping TypeCountDescription
Identity184Modern field names (unchanged)
Rename159CIA renamed the field (e.g. “GDP - real growth rate” → “Real GDP growth rate”)
Dash format64Formatting differences (single vs double dashes)
Consolidation48Sub-fields merged into parents (e.g. Oil → Petroleum)
Country-specific354Regional sub-entries, government body names
Noise281Parser artifacts, fragments (flagged IsNoise=1)

ROADMAP Recently Shipped
v3.4 — Structured Field Parsing — 1,775,588 sub-values extracted from raw text blobs into individually typed, queryable records. Each numeric value, unit, and label is now independently chartable and rankable across all 36 years. Interactive dashboard at /structured with Chart, SQL, and Source tabs.
World Leaders Database — Comprehensive leadership data for 200+ countries with governance analysis, power concentration metrics, and security apparatus tracking.
CIA Studies in Intelligence — Full-text searchable archive of declassified CIA journal articles with publication analytics and topic trends.
Geopolitical Atlas — Territorial disputes, infrastructure overlays, and OSINT missile facility data on interactive maps.
Demographics — Population pyramids with country comparison, animation, and historical overlay.
Scatter Plot Analysis — Multi-indicator scatter with regression lines, outlier detection, and region filtering.
  In Development
StarDict dictionaries — 36 offline dictionaries (1990-2025) for KOReader, GoldenDict-ng, and other StarDict-compatible apps. ~2.1M entries in per-field format. Download
Dashboard builder — Custom analytical dashboards with drag-and-drop chart layout
Analytics redesign — Traffic, audience, and security monitoring with the Dark Intelligence theme

Note
All data originates from the CIA World Factbook, a public-domain U.S. Government publication. This project is not affiliated with the CIA or U.S. Government.
All 281 Sovereign 192 Territory 65 Misc 7 Disputed 6 Crown Dep. 3 Freely Assoc. 3 Special Admin 2 Dissolved 2 Antarctic 1

281 entities

Entity Name FIPS ISO-2 Type Coverage Fields