etl¶

Run OWID's ETL client.

Create ETL step templates, compare different datasets, generate dependency visualisations, synchronise charts across different servers, import datasets from non-ETL OWID sources, improve your metadata, etc.

Note: For a UI experience, refer to CLI etlwiz.

Running etl with no subcommand opens the unified browser for searching steps and snapshots.

Usage:

[1metl[0m [[1;36mOPTIONS[0m] [1;36mCOMMAND[0m [[1;36mARGS[0m]...

Options:

Name	Type	Description	Default
`--help`	boolean	Show this message and exit.	`False`

Subcommands

anomalist: Detect anomalies.
approve: Automatically approve chart diffs with identical data. This is done by taking their configs and replacing variable IDs with hashes of their data.
archive: Archive one or more steps.
autoupdate: Automatically update data snapshots and optionally create a pull request with the changes.
chart-sync: Sync Grapher charts and revisions from an environment to the main environment.
compare: Compare two dataframes/tables/datasets in terms of their structure, values and metadata.
d: Run development tools.
diff: Compare all datasets from two catalogs and print out a summary of their differences.
graphviz: Generate a Graphviz DOT file to see all dependencies.
harmonize: Generate a dictionary with the mapping of country names to OWID's canonical names.
indicator-upgrade: Indicator upgrader CLI.
inspector: Check explorer, multidim views, chart configs, and posts (including articles, topic pages, and data insights) for typos and semantic issues.
metadata-export: Export dataset, tables & indicator metadata in YAML format.
owidbot: Post result of etl diff to Github PR.
pr: This script creates a new draft pull request in GitHub, which starts a new staging server.
run: Generate datasets by running their corresponding ETL steps.
snapshot: Create snapshot from a snapshot script or .dvc file.
update: Update one or more steps to their new version, if possible.

anomalist¶

Detect anomalies.

Usage:

[1metl anomalist[0m [[1;36mOPTIONS[0m]

Options:

Name	Type	Description	Default
`--anomaly-types`	choice (`time_change` \| `upgrade_change` \| `upgrade_missing` \| `gp_outlier`)	Type (or types) of anomaly detection algorithm to use.	None
`--dataset-ids`	integer	Generate anomalies for the variables of a specific dataset ID (or multiple dataset IDs).	None
`--variable-mapping`	text	Optional JSON dictionary mapping variable IDs from a previous to a new version (where at least some of the new variable IDs must belong to the datasets whose IDs were given).	``
`--variable-ids`	integer	Generate anomalies for a list of variable IDs (in addition to the ones from dataset ID, if any dataset was given).	None
`--dry-run` / `--no-dry-run`	boolean	Do not write to target database.	`False`
`--force`, `-f`	boolean	TBD	`False`
`--reset-db` / `--no-reset-db`	boolean	Drop anomalies table and recreate it. This is useful for development when the schema changes.	`False`
`--sample-n`	integer	Sample at most N variables from a dataset	`500`
`--append` / `--no-append`	boolean	Append anomalies to existing records instead of replacing them.	`False`
`--help`	boolean	Show this message and exit.	`False`

approve¶

Automatically approve chart diffs with identical data. This is done by taking their configs and replacing variable IDs with hashes of their data.

If the configs are then identical, the chart is approved.

The comparison process: 1. Fetches all pending chart diffs (not yet approved/rejected) 2. For each chart, compares the normalized config between environments 3. Approves charts where configs are identical 4. Reports results

Usage:

[1metl approve[0m [[1;36mOPTIONS[0m]

Options:

Name	Type	Description	Default
`--dry-run`	boolean	Preview which charts would be approved without actually approving them.	`False`
`--chart-id`	integer	Specific chart ID(s) to check. Can be specified multiple times. If not provided, checks all pending charts.	`Sentinel.UNSET`
`--verbose`, `-v`	boolean	Show detailed config differences for charts that differ between environments.	`False`
`--no-rounding`	boolean	Disable intelligent rounding before comparing data (require exact match).	`False`
`--use-max-year-hash`	boolean	Use only max year from each indicator for comparison (ignores data values).	`False`
`--help`	boolean	Show this message and exit.	`False`

archive¶

Archive one or more steps.

This tool lets you move one or more data steps from their active to their archive dag.

Examples:

Note: Remove the --dry-run if you want to actually write to the dag.

To archive a single step:
```
$ etl archive data://meadow/aviation_safety_network/2022-10-12/aviation_statistics --dry-run
```
Note that, since no steps are using this snapshot, the new snapshot will be added to the temporary dag.

To archive not only that step, but also the steps that use it:

$ etl archive data://meadow/aviation_safety_network/2022-10-12/aviation_statistics --include-usages --dry-run

Usage:

[1metl archive[0m [[1;36mOPTIONS[0m] [[1;36mSTEPS[0m]...

Options:

Name	Type	Description	Default
`--include-usages`	boolean	Archive also steps that are directly using the given steps. Default: False.	`False`
`--dry-run`	boolean	Do not write to dag. Default: False.	`False`
`--interactive` / `--non-interactive`	boolean	Skip user interactions (for confirmation and when there is ambiguity). Default: False.	`False`
`--help`	boolean	Show this message and exit.	`False`

autoupdate¶

Automatically update data snapshots and optionally create a pull request with the changes.

Main use case: Run all autoupdate-enabled snapshots, update their data if needed, and create a PR if there are changes.

Examples:

# Run all autoupdate snapshots, update data, and create PRs if needed
etl autoupdate --create-pr

# Run in dry-run mode (no changes will be made)
etl autoupdate --dry-run

# Only process snapshots matching a filter
etl autoupdate --filter "population"

Usage:

[1metl autoupdate[0m [[1;36mOPTIONS[0m]

Options:

Name	Type	Description	Default
`--dry-run`	boolean	Run in dry-run mode. No snapshot scripts will be executed and no files will be changed.	`False`
`--create-pr`	boolean	If there is an update, create a pull request with the changes.	`False`
`--filter`	text	Process only snapshots whose name includes the given substring.	`Sentinel.UNSET`
`--help`	boolean	Show this message and exit.	`False`

chart-sync¶

Sync Grapher charts and revisions from an environment to the main environment.

It syncs the charts and revisions from SOURCE to TARGET. This is especially useful for syncing work from staging servers to production.

SOURCE and TARGET can be either name of staging servers (e.g. "staging-site-mybranch") or paths to .env files or repo/commit hash if you want to get the branch name from merged pull request. Use ".env.prod.write" as TARGET to sync to live.

Note 1: The dataset (with the new chart's underlying indicators) from SOURCE must exist in TARGET. This means that you have to merge your work to master and wait for the ETL to finish running all steps.

Considerations on charts:

You get a notification if the chart has been modified on live after staging server was created.
If the chart is pending in chart-diff, you'll get a warning and Slack notification
Deleted charts are not synced.
Use --ignore-conflicts to sync approved charts ignoring conflicts. Useful when syncing between staging servers.

Considerations on tags:

Tags are synced for both new and existing charts.

Example 1: Run chart-sync in dry-run mode to see what charts will be updated

$ etl chart-sync staging-site-my-branch .env.prod.write --dry-run

Example 2: Run it for real

etl chart-sync staging-site-my-branch .env.prod.write

Example 3: Sync only one chart

etl chart-sync staging-site-my-branch .env.prod.write --chart-id 123 --dry-run

Example 4: Ignore conflicts when syncing between staging servers (useful if conflicts with master have already been dealt with in a subbranch server)

etl chart-sync staging-site-my-subbranch staging-site-baseline-branch --ignore-conflicts --dry-run

Example 5: Archive datasets that have no charts (dry-run)

etl chart-sync staging-site-my-branch .env.prod.write --archive --dry-run

Considerations on archiving:

Use --archive to automatically archive datasets in TARGET that have no charts using their indicators.
Only datasets that were part of the synced charts are considered for archiving.
Datasets are only archived if the corresponding ETL dataset is in the archive DAG (not in the active DAG).
Archived datasets are not deleted, just marked as archived.

Usage:

[1metl chart-sync[0m [[1;36mOPTIONS[0m] [1;36mSOURCE[0m [1;36mTARGET[0m

Options:

Name	Type	Description	Default
`--chart-id`	integer	Sync *only* the chart with this id.	`Sentinel.UNSET`
`--include`	text	Include only charts with variables whose catalogPath matches the provided string.	None
`--exclude`	text	Exclude charts with variables whose catalogPath matches the provided string.	None
`--dry-run` / `--no-dry-run`	boolean	Do not write to target database.	`False`
`--ignore-conflicts` / `--no-ignore-conflicts`	boolean	Sync approved charts even when conflicts are detected. Useful when syncing between staging servers.	`False`
`--archive` / `--no-archive`	boolean	Archive datasets in target that have no charts using their indicators.	`True`
`--skip-errors` / `--no-skip-errors`	boolean	Skip charts that fail during sync instead of aborting. Errors are logged.	`False`
`--help`	boolean	Show this message and exit.	`False`

compare¶

Compare two dataframes/tables/datasets in terms of their structure, values and metadata.

Usage:

[1metl compare[0m [[1;36mOPTIONS[0m] [1;36mCOMMAND[0m [[1;36mARGS[0m]...

Options:

Name	Type	Description	Default
`--absolute-tolerance`	float	The absolute tolerance for floating point comparisons.	`1e-08`
`--relative-tolerance`	float	The relative tolerance for floating point comparisons.	`0.05`
`--show-values` / `--hide-values`	boolean	Show a preview of the values where the dataframes are different.	`False`
`--show-shared` / `--hide-shared`	boolean	Show the structural overlap of the two dataframes (shared columns, index columns and index values).	`False`
`--truncate-lists-at`	integer	Print truncated lists if they are longer than the given length.	`20`
`--help`	boolean	Show this message and exit.	`False`

Subcommands

dataframes: Compare two DATAFRAME1 with DATAFRAME2.
grapher: Compare a dataset in the local database with the remote database.
table:

dataframes¶

Compare two DATAFRAME1 with DATAFRAME2.

It compares the columns, index columns and index values (row indices) as sets between the two dataframes and outputs the differences. Finally it compares the values of the overlapping columns and rows with the given threshold values for absolute and relative tolerance.

The exit code is: - 0 if the dataframes are equal - 1 if there is an error loading the dataframes - 2 if the dataframes are structurally equal but are otherwise different - 3 if the dataframes have different structure and/or different values.

Usage:

[1metl compare dataframes[0m [[1;36mOPTIONS[0m] [1;36mDATAFRAME1[0m [1;36mDATAFRAME2[0m

Options:

Name	Type	Description	Default
`--help`	boolean	Show this message and exit.	`False`

grapher¶

Compare a dataset in the local database with the remote database.

It loads the dataset from grapher/NAMESPACE/VERSION/DATASET. It compares dataset and variables metadata, and optionally the values from S3 with (use the --values flag for this). It does the comparison in the same way as the etl-catalog command.

The exit code is always 0 even if dataframes are different.

Examples:

compare  --show-values grapher ggdc 2020-10-01 ggdc_maddison__2020_10_01 --values

Usage:

[1metl compare grapher[0m [[1;36mOPTIONS[0m] [1;36mNAMESPACE[0m [1;36mVERSION[0m [1;36mDATASET[0m

Options:

Name	Type	Description	Default
`--remote-env`	path	Path to .env file with remote database credentials.	`.env.prod`
`--local-env`	path	Path to .env file with remote database credentials.	`.env`
`--values`	boolean	Compare values from S3 (can be both CPU and memory heavy!).	`False`
`--help`	boolean	Show this message and exit.	`False`

table¶

Compare a table in the local catalog with the analogous one in the remote catalog.

The table in the local catalog is loaded from CHANNEL/NAMESPACE/DATASET/{version}/TABLE. The value for {version} is given by the option --version. If not given, the latest local version of the dataset is compared with the latest remote version of the same dataset.

It compares the columns, index columns and index values (row indices) as sets between the two dataframes and outputs the differences. Finally it compares the values of the overlapping columns and rows with the given threshold values for absolute and relative tolerance.

The exit code is: - 0 if the tables are equal - 1 if there is an error loading the tables - 2 if the tables are structurally equal but are otherwise different - 3 if the tables have different structure and/or different values.

Usage:

[1metl compare table[0m [[1;36mOPTIONS[0m] [1;36mCHANNEL[0m [1;36mNAMESPACE[0m [1;36mDATASET[0m [1;36mTABLE[0m

Options:

Name	Type	Description	Default
`--version`	text	Version of catalog dataset to compare with.	None
`--debug`	boolean	Print debug information.	`False`
`--help`	boolean	Show this message and exit.	`False`

d¶

Run development tools.

Usage:

[1metl d[0m [[1;36mOPTIONS[0m] [1;36mCOMMAND[0m [[1;36mARGS[0m]...

Options:

Name	Type	Description	Default
`--help`	boolean	Show this message and exit.	`False`

Subcommands

housekeeper: Keep things in OWID catalog clean by regularly checking and reviewing content.
map-datasets: Temporary script that prints the grapher dataset pairs that need to be (manually) given to the chart upgrader, based
profile-cpu: This script runs certain step of the ETL pipeline and profiles memory or CPU usage of its run function line by line. You can additionally specify other functions to profile.
publish: Publish the generated data catalog to S3.
reindex: Create a catalog-[channel].feather file inside etl/data with all tables in each channel.
run-python-step: Import and run a specific step of the ETL.
scan-chart-diff: Scan all open PRs in the etl repository and run
version-tracker: Check that all DAG dependencies (e.g. make sure no step is missing).

housekeeper¶

Keep things in OWID catalog clean by regularly checking and reviewing content.

Usage:

[1metl d housekeeper[0m [[1;36mOPTIONS[0m]

Options:

Name	Type	Description	Default
`--review-type`, `-t`	choice (`chart`)	Type of the review	`Sentinel.UNSET`
`--channel`, `-c`	text	Name of the slack channel to send the message to. If None, #chart-reviews will be used	`Sentinel.UNSET`
`--dev`	boolean	Dev mode: replaces Slack mentions with code-formatted names so nobody gets pinged. Requires --channel.	`False`
`--help`	boolean	Show this message and exit.	`False`

map-datasets¶

Temporary script that prints the grapher dataset pairs that need to be (manually) given to the chart upgrader, based on the committed changes in your current git branch.

NOTE: * This script should eventually be part of the new indicator upgrader, but for now it can be helpful as a CLI. * The logic may be more complicated than needed. It may suffice to find the newly created grapher datasets that do not yet have charts, and then attempt to find their corresponding previous version. But some of the code can be useful for other reasons (for example in the new chart diff tool).

Usage:

[1metl d map-datasets[0m [[1;36mOPTIONS[0m]

Options:

Name	Type	Description	Default
`--help`	boolean	Show this message and exit.	`False`

profile-cpu¶

This script runs certain step of the ETL pipeline and profiles memory or CPU usage of its run function line by line. You can additionally specify other functions to profile.

Example: Profile CPU usage of run function of the step:

etl d profile --cpu garden/biodiversity/2024-01-25/cherry_blossom

Example: Profile specific functions (excludes run for cleaner output):

etl d profile --cpu garden/biodiversity/2024-01-25/cherry_blossom -f calculate_multiple_year_average
etl d profile --cpu garden/biodiversity/2024-01-25/cherry_blossom -f etl.helpers.PathFinder.load_dataset
etl d profile --cpu garden/biodiversity/2024-01-25/cherry_blossom -f etl.data_helpers.geo.RegionAggregator.__init__

To profile grapher upserts, it is better to use cProfile and run something like this:

ssh owid@staging-site-my-branch "cd etl && uv run python -m cProfile -s cumtime etl/command.py grapher://grapher/biodiversity/2024-01-25/cherry_blossom --grapher --only --force --workers 1" | head -n 100

Usage:

[1metl d profile-cpu[0m [[1;36mOPTIONS[0m] [1;36mSTEP[0m

Options:

Name	Type	Description	Default
`--cpu`	boolean	Enable CPU profiling.	`False`
`--mem`	boolean	Enable memory profiling.	`False`
`-f`, `--functions`	text	Specify functions to profile (step functions or full paths like etl.helpers.PathFinder.load_dataset). Excludes 'run' for cleaner output.	`Sentinel.UNSET`
`--help`	boolean	Show this message and exit.	`False`

publish¶

Publish the generated data catalog to S3.

Usage:

[1metl d publish[0m [[1;36mOPTIONS[0m]

Options:

Name	Type	Description	Default
`--dry-run`	boolean	Preview the datasets to sync without actually publishing them.	`False`
`--private`, `-p`	boolean	Publish private catalog.	`False`
`--bucket`, `-b`	text	Bucket name.	`owid-catalog`
`--channel`, `-c`	choice (`snapshot` \| `garden` \| `meadow` \| `grapher` \| `open_numbers` \| `examples` \| `explorers` \| `external` \| `multidim`)	Publish only selected channel (subfolder of data/), push all by default.	`('snapshot', 'garden', 'meadow', 'grapher', 'open_numbers', 'examples', 'explorers', 'external', 'multidim')`
`--help`	boolean	Show this message and exit.	`False`

reindex¶

Create a catalog-[channel].feather file inside etl/data with all tables in each channel.

This enables catalog.search to be aware of what datasets currently exists. So, if for example you create a new dataset locally, you won't be able to find it in your local catalog unless you re-run reindex.

Usage:

[1metl d reindex[0m [[1;36mOPTIONS[0m]

Options:

Name	Type	Description	Default
`--channel`, `-c`	choice (`snapshot` \| `garden` \| `meadow` \| `grapher` \| `open_numbers` \| `examples` \| `explorers` \| `external` \| `multidim`)	Reindex only the selected channel(s) (subfolders of data/)	`('snapshot', 'garden', 'meadow', 'grapher', 'open_numbers', 'examples', 'explorers', 'external', 'multidim')`
`--include`	text	Reindex only datasets matching pattern	`Sentinel.UNSET`
`--help`	boolean	Show this message and exit.	`False`

run-python-step¶

Import and run a specific step of the ETL.

Meant to be ran as a subprocess by the main etl command. There's a quite big overhead (~3s) from importing all packages again in the new subprocess.

Usage:

[1metl d run-python-step[0m [[1;36mOPTIONS[0m] [1;36mURI[0m [1;36mDEST_DIR[0m

Options:

Name	Type	Description	Default
`--ipdb`	boolean	N/A	`False`
`--help`	boolean	Show this message and exit.	`False`

scan-chart-diff¶

Scan all open PRs in the etl repository and run etl owidbot etl/branch --services chart-diff against them.

Usage:

[1metl d scan-chart-diff[0m [[1;36mOPTIONS[0m]

Options:

Name	Type	Description	Default
`--dry-run` / `--no-dry-run`	boolean	Print to console, do not post to Github.	`False`
`--help`	boolean	Show this message and exit.	`False`

version-tracker¶

Check that all DAG dependencies (e.g. make sure no step is missing).

Run all version tracker sanity checks.

Usage:

[1metl d version-tracker[0m [[1;36mOPTIONS[0m]

Options:

Name	Type	Description	Default
`--skip-db`	boolean	True to skip connecting to the database of the current environment. False to try to connect to DB, to get a better informed picture of what steps may be missing or archivable. If not connected, all checks will be based purely on the content of the ETL dag.	`False`
`--warn-on-archivable`	boolean	True to warn about archivable steps. By default this is False, because we currently have many archivable steps.	`False`
`--warn-on-unused`	boolean	True to warn about unused steps (i.e. steps that may be up-to-date, but not yet used anywhere, and hence can potentially be archived). By default this is False, because we currently have many unused steps.	`False`
`--help`	boolean	Show this message and exit.	`False`

diff¶

Compare all datasets from two catalogs and print out a summary of their differences.

Compare all the datasets from catalog in PATH_A with all the datasets in catalog PATH_B. The catalog paths link to the data/ folder with all the datasets (it contains a catalog.meta.json file)

You can also use a path to a dataset.

Note that you can use the keyword "REMOTE" as the path, if you want to run a comparison with the remote catalog.

This tool is useful as a quick way to see what has changed in the catalog and whether our updates don't have any unexpected side effects.

Note: This command differs from etl compare in that it compares all the datasets and not two specific ones.

How does it work?

It uses source checksums to find candidates for comparison. Source checksum includes all files used to generate the dataset and should be sufficient to find changed datasets, just note that we're not using checksum of the files themselves. So if you change core ETL code or some of the dependencies, e.g. change in owid-datautils-py, core ETL code or updating library version, the change won't be detected. In cases like these you should increment ETL version which is added to all source checksums (not implemented yet).

Example 1: Compare the remote catalog with a local one for changed files

$ etl diff REMOTE data/ --changed

Example 2: Compare the remote catalog with a local one

$ etl diff REMOTE data/ --include maddison

Example 3: Compare two local catalogs

$ etl diff other-data/ data/ --include maddison

Usage:

[1metl diff[0m [[1;36mOPTIONS[0m] [1;36mPATH_A[0m [1;36mPATH_B[0m

Options:

Name	Type	Description	Default
`--channel`, `-c`	choice (`snapshot` \| `garden` \| `meadow` \| `grapher` \| `open_numbers` \| `examples` \| `explorers` \| `external` \| `multidim`)	Compare only selected channel (subfolder of data/).	`['garden', 'meadow', 'grapher']`
`--changed`	boolean	Only compare datasets with changes in git. This can significantly speed it up.	`False`
`--include`	text	Compare only datasets matching pattern.	`Sentinel.UNSET`
`--cols`	text	Compare only columns matching pattern.	`Sentinel.UNSET`
`--tables`	text	Compare only tables matching pattern.	`Sentinel.UNSET`
`--exclude`, `-e`	text	Exclude datasets matching pattern.	`Sentinel.UNSET`
`--verbose`, `-v`	boolean	Print more detailed differences.	`False`
`--snippet`	boolean	Print code snippet for loading both tables, useful for debugging in notebook	`False`
`--country`	text	Filter tables by country if it is in the index.	`Sentinel.UNSET`
`--workers`, `-w`	integer	Use multiple threads.	`1`
`--help`	boolean	Show this message and exit.	`False`

graphviz¶

Generate a Graphviz DOT file to see all dependencies.

Saves the output as a file in OUTPUT_PATH.

Usage:

[1metl graphviz[0m [[1;36mOPTIONS[0m] [1;36mOUTPUT_FILE[0m

Options:

Name	Type	Description	Default
`--filter`	text	Filter the DAG by regex	`Sentinel.UNSET`
`--targets`	boolean	Show target nodes.	`False`
`--help`	boolean	Show this message and exit.	`False`

harmonize¶

Generate a dictionary with the mapping of country names to OWID's canonical names.

Harmonize the country names in COLUMN of a DATA_FILE (CSV or feather) and save the mapping to OUTPUT_FILE as a JSON file. The harmonization process is done according to OWID's canonical country names.

The harmonization process is done interactively, where the user is prompted with a list of ambiguous country names and asked to select the correct country name from a list of suggestions (ranked by similarity).

When the mapping is ambiguous, you can use:

Choose Option [custom] to enter a custom name.
Type Ctrl-C to exit and save the partially complete mapping

If a mapping file already exists, it will resume where the mapping file left off.

Usage:

[1metl harmonize[0m [[1;36mOPTIONS[0m] [1;36mDATA_FILE[0m [1;36mCOLUMN[0m [1;36mOUTPUT_FILE[0m

Options:

Name	Type	Description	Default
`--institution`, `-i`	text	Append '(institution)' to countries	None
`--num-suggestions`, `-n`	integer	Number of suggestions to show per entity. Default is 5	`5`
`--help`	boolean	Show this message and exit.	`False`

indicator-upgrade¶

Indicator upgrader CLI.

This CLI provides tools for managing indicator upgrades, including: - Matching variables between old and new datasets - Upgrading indicators in the database - Undoing indicator upgrades - Automatically detecting and upgrading dataset migrations

Usage:

[1metl indicator-upgrade[0m [[1;36mOPTIONS[0m] [1;36mCOMMAND[0m [[1;36mARGS[0m]...

Options:

Name	Type	Description	Default
`--help`	boolean	Show this message and exit.	`False`

Subcommands

auto: Automatically detect and upgrade dataset migrations.
match: Match variable IDs from an old dataset to a new dataset.
undo: Undo the last indicator upgrade.
upgrade: Upgrade indicators to use new variable mappings.

auto¶

Automatically detect and upgrade dataset migrations.

This command combines dataset detection, variable matching, and upgrade into a single workflow. It detects which datasets have been updated based on version tracker changes, matches variables between old and new versions, and applies the upgrades.

By default, only perfect matches (100% similarity) are processed automatically. Use --threshold to adjust the similarity threshold for automatic matching.

Usage:

[1metl indicator-upgrade auto[0m [[1;36mOPTIONS[0m]

Options:

Name	Type	Description	Default
`--dry-run`	boolean	Preview changes without applying them.	`False`
`--perfect-only`	boolean	Only match and upgrade indicators with perfect similarity (100%% match).	`False`
`--threshold`	float	Similarity threshold (0-100) for automatic mapping. Default: 100.0 (perfect match only)	`100.0`
`-s`, `--similarity-name`	text	Name of similarity function to use when fuzzy matching variables. Default: partial_ratio. Available methods: token_set_ratio, token_sort_ratio, partial_ratio, partial_token_set_ratio, partial_token_sort_ratio, ratio, quick_ratio, weighted_ratio.	`partial_ratio`
`--interactive`	boolean	Prompt for confirmation before processing each migration.	`False`
`--archived`	boolean	Include archived datasets in the detection.	`False`
`--help`	boolean	Show this message and exit.	`False`

match¶

Match variable IDs from an old dataset to a new dataset.

After a dataset has been uploaded to OWID's MySQL database, we need to pair new variable IDs with the old ones, so that all graphs update properly.

If the variable names are identical, the task is trivial: find indexes of old variables and map them to the indexes of the identical variables in the new dataset. However, if variable names have changed (or the number of variables have changed) the pairing may need to be done manually. This CLI helps in either scenario.

Usage:

[1metl indicator-upgrade match[0m [[1;36mOPTIONS[0m]

Options:

Name	Type	Description	Default
`--dry-run`	boolean	Print the mappings without applying them.	`False`
`-old`, `--old-dataset-id`	integer	Old dataset ID (as defined in grapher).	`Sentinel.UNSET`
`-new`, `--new-dataset-id`	integer	New dataset ID (as defined in grapher).	`Sentinel.UNSET`
`-s`, `--similarity-name`	text	Name of similarity function to use when fuzzy matching variables. Default: partial_ratio. Available methods: token_set_ratio, token_sort_ratio, partial_ratio, partial_token_set_ratio, partial_token_sort_ratio, ratio, quick_ratio, weighted_ratio.	`partial_ratio`
`-a`, `--add-identical-pairs`	boolean	If given, add variables with identical names in both datasets to the comparison. If not given, omit such variables and assume they should be paired.	`False`
`-m`, `--max-suggestions`	integer	Number of suggestions to show per old variable. That is, for every old variable at most [--max-suggestions] suggestions will be listed.	`10`
`--no-interactive`	boolean	Skip interactive prompts and automatically map variables based on similarity threshold. Best matches above the threshold will be selected automatically.	`False`
`--auto-threshold`	float	Similarity threshold (0-100) for automatic mapping when --no-interactive is used. Default: 100.0	`100.0`
`--quiet`	boolean	Minimal output - only show essential results	`False`
`--perfect-match-only`	boolean	Only match indicators with perfect similarity (100%% match). Automatically sets --no-interactive and --auto-threshold=100.0	`False`
`--help`	boolean	Show this message and exit.	`False`

undo¶

Undo the last indicator upgrade.

This command will reverse the most recent variable mapping operation, restoring charts and other references to use the previous variable IDs.

Usage:

[1metl indicator-upgrade undo[0m [[1;36mOPTIONS[0m]

Options:

Name	Type	Description	Default
`--dry-run`	boolean	Preview changes without applying them	`False`
`--help`	boolean	Show this message and exit.	`False`

upgrade¶

Upgrade indicators to use new variable mappings.

This command will apply the variable mappings stored in the database to update all charts and other references to use the new variable IDs.

The variable mappings must have been previously created using the 'match' command or through the Streamlit UI.

Usage:

[1metl indicator-upgrade upgrade[0m [[1;36mOPTIONS[0m]

Options:

Name	Type	Description	Default
`--dry-run`	boolean	Preview changes without applying them	`False`
`--help`	boolean	Show this message and exit.	`False`

inspector¶

Check explorer, multidim views, chart configs, and posts (including articles, topic pages, and data insights) for typos and semantic issues.

Examples:

# Check specific slug (with --slug or -s)
$ etl inspector -s natural-disasters

# For simplicity, the slug applies to all content types (explorers, mdims and posts); you can specify one (or more) content type (with --type or -t)
$ etl inspector -s natural-disasters -t post

# etl inspector -s natural-disasters -t post -t explorer

# You can check multiple slugs
$ etl inspector -s global-food -s natural-disasters

# Use different models (with --model or -m); currently available models are haiku=fast/cheap, sonnet=balanced, opus=best quality/more expensive
$ etl inspector -s global-food -m sonnet
$ etl inspector -s natural-disasters -m opus

# Save results to an output file (with --output-file or -o); it auto-resumes if interrupted, skipping already inspected content
$ etl inspector -o issues.csv
$ etl inspector -s global-food -o food_issues.csv

# Limit number of views (useful for testing/cost control)
$ etl inspector -l 10
$ etl inspector -s natural-disasters -l 5 -m haiku

# Skip specific checks
$ etl inspector --skip-issues  # Only run typo checks with codespell (skipping AI inspection); this option has no cost
$ etl inspector --skip-typos  # Only run semantic checks (skipping codespell inspection); this option has costs

# Estimate costs without running (dry run)
$ etl inspector --dry-run
$ etl inspector -s global-food -m opus --dry-run

# Enable experimental grouping/pruning (for large result sets)
$ etl inspector --enable-grouping

# Debug: see exact prompts sent to Claude
$ etl inspector -s energy -l 1 --display-prompt

Usage:

[1metl inspector[0m [[1;36mOPTIONS[0m]

Options:

Name	Type	Description	Default
`--slug`, `-s`	text	Filter by specific slug (explorer, multidim, chart, or post). Can be specified multiple times (e.g., '-s global-food -s covid-boosters')	`Sentinel.UNSET`
`--type`, `-t`	choice (`explorer` \| `multidim` \| `chart` \| `post`)	Filter by content type. Can be specified multiple times (e.g., '-t post -t explorer'). Useful when a slug exists in multiple types.	`Sentinel.UNSET`
`--model`, `-m`	choice (`haiku` \| `sonnet` \| `opus`)	Claude model to use for issue detection (default: sonnet). Haiku is fastest/cheapest, Sonnet is balanced, Opus is highest quality.	`Sentinel.UNSET`
`--skip-typos`	boolean	Skip typo checking (codespell)	`False`
`--skip-issues`	boolean	Skip semantic issue checking (Claude API)	`False`
`--enable-grouping`	boolean	Enable grouping and pruning of similar issues (EXPERIMENTAL: may not work well with large numbers of collections)	`False`
`--output-file`, `-o`	path	Save issues to CSV file	`Sentinel.UNSET`
`--limit`, `-l`	integer	Limit number of views to analyze (useful for testing to reduce API costs)	None
`--dry-run`	boolean	Estimate API costs without making actual API calls	`False`
`--display-prompt`	boolean	Print the exact prompts sent to Claude API	`False`
`--help`	boolean	Show this message and exit.	`False`

metadata-export¶

Export dataset, tables & indicator metadata in YAML format.

Given a DATASET_PATH, load the corresponding dataset and export its metadata in YAML format (including table and indicator metadata). The metadata file and can be later edited manually. If the output YAML already exists, it will be updated with new values.

When can this be useful? - This is useful when some metadata fields have been created dynamically in the code and you want to see the final result. - To prefill the YAML metadata file with the list of indicators and tables in the dataset. Note that, when first created, an ETL step is not yet aware of the columns of the tables of the dataset. It only knows that once you've executed th step.

Example 1: Save to YAML file etl/steps/data/garden/ggdc/2020-10-01/ggdc_maddison.meta.yml

etl metadata-export data/garden/ggdc/2020-10-01/ggdc_maddison

Example 2: Show output instead of saving the file

etl metadata-export data/garden/ggdc/2020-10-01/ggdc_maddison --show

Usage:

[1metl metadata-export[0m [[1;36mOPTIONS[0m] [1;36mDATASET_PATH[0m

Options:

Name	Type	Description	Default
`-o`, `--output`	path	Save output into YAML file. If not specified, save to *.meta.yml	`Sentinel.UNSET`
`--show` / `--no-show`	boolean	Show output instead of saving it into a file.	`False`
`--decimals`	text	Add display.numDecimalPlaces to all numeric variables. Use integer or `auto` for autodetection. Disable with `no`.	`auto`
`--help`	boolean	Show this message and exit.	`False`

owidbot¶

Post result of etl diff to Github PR.

Example:

$ etl owidbot etl/my-branch --services data-diff chart-diff --dry-run
$ etl owidbot owid-grapher/my-branch --services grapher --dry-run

Usage:

[1metl owidbot[0m [[1;36mOPTIONS[0m] [1;36mREPO_BRANCH[0m

Options:

Name	Type	Description	Default
`--services`	text	N/A	`Sentinel.UNSET`
`--include`	text	Include datasets matching this regex.	`garden`
`--dry-run` / `--no-dry-run`	boolean	Print to console, do not post to Github.	`False`
`--help`	boolean	Show this message and exit.	`False`

pr¶

This script creates a new draft pull request in GitHub, which starts a new staging server.

Arguments:

TITLE: The title of the PR. This must be given.

CATEGORY: The category of the PR. This is optional. If not given, the user will be prompted to choose one.

Main use case: Branch out from master to a temporary work_branch, and create a PR to merge work_branch -> master. You will be asked to choose a category. The value of work_branch will be auto-generated based on the title and the category.

# Without specifying a category (you will be prompted for a category)
etl pr "some title for the PR"

# With a category
etl pr "some title for the PR" data

# With private stating server
etl pr "some title for the PR" --private

Custom use case (1): Same as main use case, but with a specific branch name for the work_branch.

etl pr "some title for the PR" --work-branch "this-temporary-branch"
# Shorter
etl pr "some title for the PR" -w "this-temporary-branch"

Custom use case (2): Create a pull request from current_branch to master.

etl pr "some title for the PR" --direct

Custom use case (3): Create a pull request from branch this-temporary-branch -> develop.

etl pr "some title for the PR" --direct --base-branch "develop" --work-branch "this-temporary-branch"
# Shorter
etl pr "some title for the PR" --direct -b "develop" -w "this-temporary-branch"

Custom use case (4): Create the new branch in a sibling git worktree so you can keep editing your current branch in parallel.

etl pr "some title for the PR" --worktree
# Shorter
etl pr "some title for the PR" -t
# With a custom path
etl pr "some title for the PR" -t --worktree-path /tmp/etl-mybranch

The new working directory is printed at the end (default: ../etl-BRANCH); cd into it to start working there.

Custom use case (5): Share the original repo's data/ directory with the new worktree, so ETL steps don't have to recompute population, regions, etc.

etl pr "some title for the PR" -t --share-data

This makes the new worktree's data/ a shortcut (symlink) to the original repo's data/, so both worktrees share the same ETL outputs and you don't have to recompute them. Note that data/ is a symlink to the original repo's data/, so: - If you run the same steps in both worktrees, they may overwrite each other's output. - DO NOT use rm -rf data/; this would wipe both the symlink and the original data folder. Instead, use git worktree remove ../etl-[whatever-branch] to remove a worktree.

After the command finishes, uv sync has already run inside the worktree, so its .venv/ is ready to use. With a chpwd hook in your ~/.zshrc that sources .venv/bin/activate whenever present, cd ../etl-BRANCH is all that's needed — activation is automatic. Without the hook, also run source .venv/bin/activate after the cd. Skipping activation silently routes etl/etlr to the original repo's source code.

See the docs (Working on multiple branches in parallel) for full details and tips.

Usage:

[33m                                                                              [0m 
 [1metl pr                                                                        [0m 
 [[1;36mOPTIONS[0m] [1;36mTITLE[0m                                                                
 [[[1;36mdata[0m|[1;36mbug[0m|[1;36mrefactor[0m|[1;36menhance[0m|[1;36mfeature[0m|[1;36mdocs[0m|[1;36mchore[0m|[1;36mstyle[0m|[1;36mwip[0m|[1;36mtests[0m]]

Options:

Name	Type	Description	Default
`--scope`, `-s`	text	Scope of the PR (only relevant if --title is given). This text will be preprended to the PR title. Examples: 'demography' for data work on this field, 'etl.db' if working on specific modules, 'wizard', etc.	None
`--work-branch`, `-w`	text	The name of the work branch to create. It is auto-generated based on the title and the category. If --direct is used, this is the PR source branch and defaults to the current branch.	None
`--base-branch`, `-b`	text	Name of the base branch. This is the branch to branch out from and merge back into. If --direct is used, this is the PR target branch.	`master`
`--direct`, `-d`	boolean	Directly create a PR from the current branch to the target branch (default: master).	`False`
`--private`, `-p`	boolean	By default, staging server site (not admin) will be publicly accessible. Use --private to have it private instead. This does not apply when using --direct mode.	`False`
`--no-llm`, `-n`	boolean	We briefly use LLMs to simplify the title and use it in the branch name. Disable this by using -n flag.	`False`
`--worktree`, `-t`	boolean	Create the new branch in a sibling git worktree (default: ../etl-BRANCH) instead of mutating the current working tree. Useful for working on multiple branches in parallel.	`False`
`--worktree-path`	text	Override the worktree directory (only with --worktree). Defaults to ../etl-BRANCH.	None
`--share-data`	boolean	Symlink the new worktree's data/ to the original repo's data/ (only with --worktree). Avoids recomputing upstream ETL steps. Don't run heavy ETL ops in both worktrees concurrently, and never `rm -rf data/` in the worktree.	`False`
`--help`	boolean	Show this message and exit.	`False`

run¶

Generate datasets by running their corresponding ETL steps.

Run all ETL steps in the DAG matching the value of STEPS. A match is a dataset with an uri that contains the value of any of the words in STEPS.

Example 1: Run steps matching "mars" in the DAG file:

$ etl run mars

Example 2: Preview those steps that match "mars" or "prio" (i.e. don't run them):

$ etl run mars prio

Example 3: If you only want to preview what would be executed, use the --dry-run flag:

$ etl run mars prio --dry-run

Usage:

[1metl run[0m [[1;36mOPTIONS[0m] [[1;36mSTEPS[0m]...

Options:

Name	Type	Description	Default
`--dry-run`	boolean	Preview the steps without actually running them.	`False`
`--force`, `-f`	boolean	Re-run the steps even if they appear done and up-to-date	`False`
`--private`, `-p`	boolean	Run private steps.	`False`
`--grapher`, `-g` / `--no-grapher`, `-ng`	boolean	Upsert datasets from grapher channel to DB (OWID staff only, DB access required)	`False`
`--export` / `--no-export`	boolean	Run export steps like saving explorer (OWID staff only, access required)	`False`
`--ipdb`	boolean	Run the debugger on uncaught exceptions.	`False`
`--only`, `-o`	boolean	Only run the selected step (no upstream dependencies).	`False`
`--exact-match`, `-x`	boolean	Steps should exactly match the arguments (if so, pass the steps with their full name, e.g. 'data://garden/.../step_name').	`False`
`--exclude`, `-e`	text	Comma-separated patterns to exclude	`Sentinel.UNSET`
`--dag-path`	path	Path to DAG yaml file	`/home/runner/work/etl/etl/dag/main.yml`
`--workers`, `-w`	integer	Parallelize execution of steps.	`1`
`--use-threads`, `-t` / `--no-threads`, `-nt`	boolean	Use threads when checking dirty steps and upserting to MySQL. Turn off when debugging.	`True`
`--strict`, `-s` / `--no-strict`, `-ns`	boolean	Force strict or lax validation on DAG steps (e.g. checks for primary keys in data steps).	None
`--watch`	boolean	Run ETL infinitely and update changed files.	`False`
`--continue-on-failure`	boolean	Continue running remaining steps if a step fails (steps depending on failed step will be skipped).	`False`
`--force-upload`	boolean	Always upload grapher data & metadata JSON files even if checksums match.	`False`
`--prefer-download`	boolean	Prefer downloading datasets from catalog instead of building them.	`False`
`--subset`	text	Filter to speed up development - works as regex for both data processing and grapher upload.	`Sentinel.UNSET`
`--help`	boolean	Show this message and exit.	`False`

snapshot¶

Create snapshot from a snapshot script or .dvc file.

DATASET_PATH can be provided in several formats: - Full path: namespace/version/short_name (e.g., tourism/2024-08-17/unwto_gdp) - Partial path: version/short_name (e.g., 2024-08-17/unwto_gdp) - Short name only: short_name (e.g., unwto_gdp) - Full file path: snapshots/namespace/version/short_name.py

The command will automatically find the corresponding .py script or .dvc file in the snapshots directory. If multiple matches are found, you'll need to provide a more specific path.

Run snapshot scripts in a standardized way. Supports three scenarios: 1. Scripts with main() function - runs module directly 2. Scripts with run() function - wraps in CLI with upload/path-to-file args 3. Scripts with only .dvc file - runs snap.create_snapshot()

Examples:

etl snapshot tourism/2024-08-17/unwto_gdp
etls tourism/2024-08-17/unwto_gdp
etls 2024-08-17/unwto_gdp
etls snapshots/tourism/2024-08-17/unwto_gdp.py

etl snapshot abs/2024-08-06/employee_earnings_and_hours_australia_2008 --path-to-file data.csv
etls abs/2024-08-06/employee_earnings_and_hours_australia_2008 --path-to-file data.csv
etls 2024-08-06/employee_earnings_and_hours_australia_2008 --path-to-file data.csv

etl snapshot dataset_name --skip-upload
etls dataset_name --skip-upload

etl snapshot dataset_name --dry-run
etls dataset_name --dry-run

Usage:

[1metl snapshot[0m [[1;36mOPTIONS[0m] [1;36mDATASET_PATH[0m

Options:

Name	Type	Description	Default
`--upload` / `--skip-upload`	boolean	Upload dataset to Snapshot	`True`
`--path-to-file`	text	Path to local data file (for manual upload scenarios)	`Sentinel.UNSET`
`--dry-run`	boolean	Preview what would happen without creating/uploading the snapshot	`False`
`--help`	boolean	Show this message and exit.	`False`

update¶

Update one or more steps to their new version, if possible.

This tool lets you update one or more snapshots or data steps to a new version. It will:

Create new folders and files for each of the steps.
Add the new steps to the dag, with the same header comments as their current version.

Notes:

Keep in mind that:

If there is ambiguity, the user will be asked for confirmation before updating each step, and on situations where there is some ambiguity.
If new snapshots are created that are not used by any steps, they are added to a temporary dag (temp.yml). These steps are then removed from the temporary dag as soon as they are used by an active step.
All dependencies of new steps will be assumed to use their latest version possible.
Steps whose version is already equal to the new version will be skipped.
Wildcard patterns (*, ?, [abc]) are supported in step names for batch operations.

Examples:

Note: Remove the --dry-run if you want to actually execute the updates in the examples below (but then remember to revert changes).

To update a single snapshot to the new version:
```
$ etl update snapshot://animal_welfare/2023-10-24/fur_laws.xlsx --dry-run
```
Note that, since no steps are using this snapshot, the new snapshot will be added to the temporary dag.

To update not only that snapshot, but also the steps that use it:

$ etl update snapshot://animal_welfare/2023-10-24/fur_laws.xlsx --include-usages --dry-run

To update all dependencies of the climate change impacts explorer:
```
$ etl update data://explorers/climate/latest/climate_change_impacts --include-dependencies --dry-run
```
Note that the code of the explorers step itself will not be updated (since it has version "latest"), but its dependencies will be updated in the dag.
To update all snapshots in a namespace/version directory using wildcards:
```
$ etl update "snapshot://climate/2025-07-18/*" --dry-run
```
This will update all snapshot files matching the pattern in the climate/2025-07-18 directory.

Usage:

[1metl update[0m [[1;36mOPTIONS[0m] [[1;36mSTEPS[0m]...

Options:

Name	Type	Description	Default
`--step-version-new`	text	New version for step. Default: 2026-05-21.	`2026-05-21`
`--include-dependencies`	boolean	Update also steps that are direct dependencies of the given steps. Default: False.	`False`
`--include-usages`	boolean	Update also steps that are directly using the given steps. Default: False.	`False`
`--dry-run`	boolean	Do not write to dag or create step files. Default: False.	`False`
`--interactive` / `--non-interactive`	boolean	Skip user interactions (for confirmation and when there is ambiguity). Default: False.	`False`
`--direct-only`	boolean	When used with --include-usages, only include steps with the same namespace/version/short_name pattern. Default: False.	`False`
`--help`	boolean	Show this message and exit.	`False`