Datasets and data catalogs

FollowTheMoney entities are almost always grouped into datasets — the unit of provenance, versioning, and access control in most systems that use FtM data. A dataset describes where the entities came from, who published them, what time range they cover, and how the reader can obtain them.

Entity objects record dataset membership as part of their structure. A ValueEntity carries the set of datasets it belongs to as an entity-level datasets field. A StatementEntity records dataset membership per statement, so a single entity can carry values attributed to different sources. See Entity representations for the full distinction.

Data catalogs

Metadata is published in two forms: as a standalone dataset index file, or as a data catalog. A catalog combines the metadata for multiple datasets into one document, with per-dataset metadata contained in a top-level datasets array.

Collections

A collection is a dataset whose purpose is to group other datasets. Its metadata looks like any other dataset — it has a name, title, description, and so on — but it also lists its member datasets under children (or the alias datasets: in catalog YAML). When a query or an export targets a collection, it resolves transitively to the leaf datasets the collection contains.

There is no separate schema or class for collections; they are regular Dataset instances that happen to contain others. The children field can be nested, so collections of collections are valid. A dataset can belong to multiple collections without duplication.

Dataset metadata

The most important piece of metadata for any dataset is its name. Names are lowercase, underscore-linked short identifiers (eg. us_ofac_sdn) used in the actual entity data to reference a data source. Inside the metadata (or a catalog entry), the following fields can be found:

Section	Field	Description
`dataset`
	`name`	Dataset’s unique identifier
	`title`	Human-readable title
	`summary`	Short summary string
	`description`	Detailed description of the dataset in markdown syntax
	`tags`	List of tags assigned to this dataset
	`index_url`	URL to dataset metadata file
	`version`	Latest dataset version. Each data update produces a new version ID, and version IDs can be relied on to be sortable strings.
	`last_change`	Timestamp when any entity in the dataset last changed. This marks when the system discovered the change, not when published at source. Also note that changes to our data cleaning tools may result in changes reflected here as well.
	`last_export`	Timestamp of the most recent dataset crawl and export. This is the time of when the process in question was started, not when the resulting data was uploaded to our public archive.
	`datasets`	All data sources (and enrichment datasets) included in this collection
	`resources`	Array of objects describing associated files, including exports and source data.
	`updated_at`	Use `last_export` instead.
`coverage`		Coverage metadata object.
	`start`	Date of the first time the dataset was included in the database.
	`countries`	List of the countries covered by this dataset.
	`frequency`	One of: `never`, `hourly`, `daily`, `weekly`, `monthly`, `annually`
	`schedule`	A more precise (cron-style) specification of the update frequency
`publisher`
	`name`	Publishing source name
	`acronym`	Pubshlishing source acronym (e.g. OFAC)
	`description`	Detailed description of publishing source, uses markdown.
	`url`	Link to the publisher's home page
	`country`	Originating country (code) of publishing source
	`country_label`	Originating country (name) of publishing source
	`official`	`true` if the publisher is a government or inter-governmental organization.
`resources`
	`name`	Identifier for this export
	`url`	Direct download URL where the resource file is fetched
	`checksum`	SHA1 of the resource contents
	`mime_type`	The MIME type of the resource (eg. text/csv)
	`mime_type_label`	Human-readable label for the MIME type
	`title`	Title of the resource
	`size`	Size of the resource in bytes

Dataset query DSL

The Python library includes a small query DSL for filtering datasets by name, collection membership, or tags. It is available as followthemoney.dataset.evaluate_query:

from followthemoney.dataset import DataCatalog, Dataset, evaluate_query

catalog = DataCatalog.from_path(Dataset, "catalog.yml")
results = evaluate_query(catalog, {"and": ["#issuer.eu", "#list.sanction"]})

Grammar

A query is a recursive structure with three operators and string leaves:

DatasetQuery = str | list[DatasetQuery]
             | {"or": list[DatasetQuery]}
             | {"and": list[DatasetQuery]}
             | {"not": DatasetQuery}

Leaf values

Pattern	Meaning	Example
`"datasetname"`	A specific dataset by slug	`"us_ofac_sdn"`
`"collectionname"`	A collection, expanded to its leaf datasets	`"sanctions"`
`"#tag"`	All datasets matching a tag	`"#issuer.eu"`

The # prefix is query syntax only — the stored tag value is issuer.eu, not #issuer.eu. Collections are always expanded to their leaf datasets.

Operators

or — union of all sub-queries:

{"or": ["#list.sanction", "#list.debarment"]}

and — intersection of all sub-queries:

{"and": ["#issuer.eu", "#list.sanction"]}

not — complement (all datasets in the catalog except the matched ones):

{"not": "lt_fiu_sanctions"}

A bare list is shorthand for or, so ["a", "b"] is equivalent to {"or": ["a", "b"]}.

Examples

A single dataset:

"us_ofac_sdn"

Multiple datasets:

["us_ofac_sdn", "eu_fsf", "gb_hmt_sanctions"]

All datasets tagged as sanctions:

"#list.sanction"

EU-issued sanctions lists:

{"and": ["#issuer.eu", "#list.sanction"]}

Western sanctions and debarments, excluding a specific dataset and all Russian-issued sources:

{"and": [
    {"or": ["#issuer.west", "#list.sanction", "#list.debarment"]},
    {"not": "lt_fiu_sanctions"},
    {"not": "#issuer.ru"}
]}

Exclude multiple datasets at once using nested not/or:

{"and": [
    "#list.sanction",
    {"not": {"or": ["lt_fiu_sanctions", "ru_fedsfm"]}}
]}

String syntax

For use in URLs, CLI arguments, or configuration files, queries can also be written as compact strings using parse_query:

from followthemoney.dataset import parse_query, evaluate_query

query = parse_query("(#issuer.west|#list.sanction)-lt_fiu-#issuer.ru")
results = evaluate_query(catalog, query)

The string syntax uses three operators, listed by precedence (high to low):

Operator	Meaning	JSON equivalent
`()`	Grouping	Nesting
`&`	Intersection	`and`
`-`	Subtraction	`and` + `not`
`\\|`	Union	`or`

& and - bind tighter than |, so a|b&c is parsed as a|(b&c).

Subtraction desugars into and + not:

String	JSON AST
`sanctions`	`"sanctions"`
`a\\|b\\|c`	`{"or": ["a", "b", "c"]}`
`a&b`	`{"and": ["a", "b"]}`
`a-b`	`{"and": ["a", {"not": "b"}]}`
`(a\\|b)-c`	`{"and": [{"or": ["a", "b"]}, {"not": "c"}]}`
`(#issuer.west\\|#list.sanction)-lt_fiu-#issuer.ru`	`{"and": [{"or": ["#issuer.west", "#list.sanction"]}, {"not": "lt_fiu"}, {"not": "#issuer.ru"}]}`

Relevant standards

The dataset specification in FtM is largely based on Google's schema.org/Dataset, which allows for SEO-friendly markup on dataset pages. Various similar specifications exist, for example the W3C's Data Catalog Vocabulary (DCAT) and the Frictionless Data Package.

All of these specifications are roughly compatible, and it should be easy to import or export FtM metadata into any of them.

Python API

For programmatic construction, loading, and filtering of catalogs, see the dataset API reference. It covers DataCatalog (including from_path() for loading catalog YAML), Dataset, the DataResource/DataPublisher/DataCoverage sub-objects, and the evaluate_query / match_datasets / parse_query / validate_query query helpers.