Structuring yaml
Working with YAML files in VS Code
Install the YAML extension (by Red Hat) for VS Code to get syntax highlighting and autocompletion for YAML files. This extension will validate your files and highlight any syntax errors.
Creating YAML Files¶
Metadata YAML files are typically stored within a garden step as my_dataset.meta.yml. Their content is applied at the very end of any ETL step. Therefore, YAML files have "the final word" on the metadata of any step. The conventional structure is as follows:
To generate a metadata YAML file with pre-populated variable names for an existing garden dataset, execute:
Check uv run etl metadata-export --help for more options.
Handling Multi-line Strings and Whitespace¶
Multi-line strings are often sources of confusion. YAML multiline supports two primary styles for writing them (literal and folded style), and it's up to you which option to use.
In addition, using the "strip" chomping indicator, denoted with -, after | or > removes whitespaces at the beginning and end of the string. This is almost always what you want.
Literal style |¶
It is denoted with the | block style indicator. Line breaks in the YAML file are treated as line breaks.
Note
This implies that lines of text in the YAML file can become very long; to be able to read them on a text editor without needing to scroll left and right, use "Word wrap" (or Option+Z in VS Code on Mac).
Folded style >¶
It is denoted with the > block style indicator. Line breaks in the YAML file are treated like spaces; to create a line break, you need a double line break in the YAML file.
Note
This avoids having lines of text that are too long in the YAML file. However, if you want to rephrase a paragraph, you may need to manually rearrange the line breaks afterwards.
Anchors & aliases¶
Anchors (&) and aliases (*) are a native YAML functionality and they can be used to reduce repetition.
You can define anchors anywhere on the YAML file, but typically we define a special section called definitions at the very top of the file, and then use aliases to refer to these definitions.
An example that reuses attribution and description_key:
definitions:
attribution: &attribution Fishcount (2018)
description_key_common: &description_key_common
- First line.
- Second line.
description_key_individual:
- &third_line Third line.
tables:
my_table:
variables:
my_var_1:
description_key: *description_key_common
presentation:
attribution: *attribution
my_var_2:
description_key:
- *description_key_common
- *third_line
Note that the case of description_key is a bit special: You can use anchor/aliases for both the entire list of bullet points, and also individual points. We have implemented some logic so that the result is always a list of bullet points.
Common fields for all indicators¶
To avoid repetition for all indicators, you can use a special section called common: under definitions:. This section sets the default metadata for indicator if there's no specific metadata defined in tables:. Using this saves you from repeating the same aliases in indicators. Note that it doesn't merge metadata, but overwrites. If you look for merge, check out <<: override operator.
definitions:
common:
display:
numDecimalPlaces: 1
description_key:
- First line.
- Second line.
presentation:
grapher_config:
selectedEntityNames:
- Germany
- Italy
topic_tags:
- Energy
tables:
my_table:
variables:
my_var_1:
# Description will be Third line
description_key:
- Third line.
# Display won't be inherited from common!
display:
name: My var
presentation:
# Tag will be Internet
topic_tags:
- Internet
my_var_2:
# Description will be First line, Second line
# Tag will be Energy
# Display will be inherited from common
Specific metadata in variables overrides the common metadata. If you want to merge it, you can use <<: override operator.
definitions:
display: &common-display
numDecimalPlaces: 1
tables:
my_table:
variables:
my_var_1:
display:
name: My var
<<: *common-display
You can also specify common for individual tables that would overwrite the common section under definitions.
Dynamic YAML¶
Anchors and aliases have limitations. One of the main ones is that you cannot use it for in-line text. That's why we've added support for dynamic-yaml which lets you write YAML files likes this:
definitions:
additional_info: |-
You should also know this.
tables:
my_table:
variables:
my_var_1:
description_short: |-
This is a description.
{ additional_info }
Using Jinja Templates for Advanced Cases¶
Even more complex metadata can be generated with Jinja templates. This is especially useful for datasets in a long format and multiple dimensions, because Jinja lets you dynamically generate text (titles, descriptions, ...) based on the dimension names.
Note
We use a slightly flavoured Jinja, where we use <% if ... %> and << var >> instead of the defaults {% if ... %} and {{ var }}.
Find below a more complex example with dimension conflict_type. In this example, we use Jinja combined with dynamic YAML. Note that the dimension values are available through variables with the same name.
definitions:
conflict_type_estimate: |-
<% if conflict_type == "all" %>
The << estimate >> estimate of the number of deaths...
<% elif conflict_type == "inter-state" %>
...
<% endif %>
tables:
ucdp:
variables:
number_deaths_ongoing_conflicts_high:
description_processing: |-
<% set estimate = "high" %>
{definitions.conflict_type_estimate}
number_deaths_ongoing_conflicts_low:
description_processing: |-
<% set estimate = "low" %>
{definitions.conflict_type_estimate}
It's also possible to do the same with Jinja macros. Check out section below and pick your favorite.
Whitespaces¶
Line breaks and whitespaces can be tricky when using Jinja templates. We use reasonable defaults and strip whitespaces, so in most cases you should be fine with using <% and %>, but in more complex cases, you might have to experiment with
more fine grained whitespace control using tags <%- and -%>. This is most often used in if-else blocks like this (note the - after <% for all clauses except for the first one):
age: |-
<% if age_group == "ALLAges" %>
...
<%- elif age_group == "Age-standardized" %>
...
<%- else %>
...
<%- endif %>
An alternative to whitespace control is using the if statements in a single line, like this:
age: |-
<% if age_group == "ALLAges" %>
...<%- elif age_group == "Age-standardized" %>
...<%- else %>
...<%- endif %>
Empty Jinja Output¶
A Jinja template that renders to an empty string (e.g. <% if cond %>val<% endif %> when cond is false) keeps the field as "". This is intentional — it lets you force-empty fields like subtitle or note, which Grapher reads as "render no subtitle" instead of falling back to description_short:
# variant == "estimates" → subtitle stays as "" → Grapher renders no subtitle
grapher_config:
subtitle: "<% if variant != 'estimates' %>Future projections under the {{ variant }} scenario.<% endif %>"
If you don't want an empty result, add an <% else %> branch with the alternative content.
as_value for typed numeric fields¶
A few Grapher schema fields are strict-typed (e.g. yAxis.min, yAxis.max, comparisonLines[].yEquals). Plain Jinja always returns a string, so <% if cond %>90<% endif %> would render to "90" and fail validation. Use the as_value filter to coerce to int/float:
grapher_config:
yAxis:
min: "<% if age == '0' %><< 90 | as_value >><% endif %>"
max: "<% if age == '0' %><< 120 | as_value >><% endif %>"
When the conditional branch doesn't fire, the as_value-marked field is dropped (rather than left as "") so the schema doesn't see a string where it expected a number. This drop also propagates: if every numeric field in a list-of-dicts entry uses as_value and renders empty, the resulting empty {} is removed (avoids shipping comparisonLines: [{}]).
The filter only makes sense on numeric fields — don't use it on strings.
Checking Metadata¶
The most straightforward way to check your metadata is in Admin, although that means waiting for your step to finish. There's a faster way to check your YAML file directly. Create a playground.ipynb notebook in the same folder as your YAML file and copy this to the first cell:
import etl.grapher.helpers as gh
dim_dict = {"age_group": "YEARS0-4", "sex": "Male", "cause": "Drug use disorders"}
d = gh.render_yaml_file("ghe.meta.yml", dim_dict=dim_dict)
d["tables"]["ghe"]["variables"]["death_count"]
An alternative is examining VariableMeta
from owid.catalog import Dataset
from etl import paths
ds = Dataset(paths.DATA_DIR / "garden/emissions/2025-02-12/ceds_air_pollutants")
tb = ds['ceds_air_pollutants']
tb.emissions.m.render({'pollutant': 'CO', 'sector': 'Transport'})
Jinja Macros¶
Jinja macros could often be a good way to avoid repetition in your metadata. Define macros in field macros: and then import them with {macros}. For example:
macros: |-
<% macro conflict_type_estimate(conflict_type, estimate) %>
<% if conflict_type == "all" %>
The << estimate >> estimate of the number of deaths...
<% elif conflict_type == "inter-state" %>
...
<% endif %>
<% endmacro %>
tables:
ucdp:
variables:
number_deaths_ongoing_conflicts_high:
description_processing: |-
{macros}
<< conflict_type_estimate(conflict_type, "high") >>
number_deaths_ongoing_conflicts_low:
description_processing: |-
{macros}
<< conflict_type_estimate(conflict_type, "low") >>
Reusing definitions across tables through shared.meta.yml
If you have multiple *.meta.yml files that share the same metadata, you can put shared definitions: and macros: into shared.meta.yml file. All other *.meta.yml files can then use them.
For example, define a macro for formatting sex in shared.meta.yml:
macros: |-
<% macro format_sex(sex) %>
<%- if sex == "Both" -%>
individuals
<%- elif sex == "Male" -%>
males
<%- elif sex == "Female" -%>
females
<%- endif -%>
<% endmacro %>
Then, in your *.meta.yml files, call it as a function
Metadata Postprocessing After Dataset Creation¶
In some cases, you may need to programmatically modify metadata after creating a dataset with create_dataset(). This pattern is useful when metadata modifications require dynamic content that cannot be easily expressed in YAML files.
Pattern¶
- Create the dataset using
create_dataset() - Retrieve the table from the created dataset
- Modify metadata programmatically (e.g., using regex, calculations, or conditional logic)
- Re-add the modified table to the dataset using
ds.add() - Save the dataset with
ds.save()
Example¶
# Create a new grapher dataset
ds_grapher = create_dataset(
dest_dir, tables=[tb], check_variables_metadata=True, default_metadata=ds_garden.metadata
)
# Retrieve the table from the created dataset for metadata modification
tb = ds_grapher["my_table"]
# Modify metadata programmatically
for col in tb.columns:
m = tb[col].m
# Example: Replace placeholder text with dynamic values
if "ANSWER" in m.title:
answer_match = re.search(r"(\d+)", col)
answer_num = answer_match.group(1)
m.title = m.title.replace("ANSWER", answer_num)
m.description_short = m.description_short.replace("ANSWER", answer_num)
# Re-add the modified table to the dataset
ds_grapher.add(tb)
# Save the changes
ds_grapher.save()
This approach allows you to combine the declarative power of YAML metadata with programmatic flexibility when needed.