Skip to content

Datasets and data catalogs

FollowTheMoney entities are almost always grouped into datasets — the unit of provenance, versioning, and access control in most systems that use FtM data. A dataset describes where the entities came from, who published them, what time range they cover, and how the reader can obtain them.

Entity objects record dataset membership as part of their structure. A ValueEntity carries the set of datasets it belongs to as an entity-level datasets field. A StatementEntity records dataset membership per statement, so a single entity can carry values attributed to different sources. See Entity representations for the full distinction.

Data catalogs

Metadata is published in two forms: as a standalone dataset index file, or as a data catalog. A catalog combines the metadata for multiple datasets into one document, with per-dataset metadata contained in a top-level datasets array.

Collections

A collection is a dataset whose purpose is to group other datasets. Its metadata looks like any other dataset — it has a name, title, description, and so on — but it also lists its member datasets under children (or the alias datasets: in catalog YAML). When a query or an export targets a collection, it resolves transitively to the leaf datasets the collection contains.

There is no separate schema or class for collections; they are regular Dataset instances that happen to contain others. The children field can be nested, so collections of collections are valid. A dataset can belong to multiple collections without duplication.

Dataset metadata

The most important piece of metadata for any dataset is its name. Names are lowercase, underscore-linked short identifiers (eg. us_ofac_sdn) used in the actual entity data to reference a data source. Inside the metadata (or a catalog entry), the following fields can be found:

Section Field Description
dataset
name Dataset’s unique identifier
title Human-readable title
summary Short summary string
description Detailed description of the dataset in markdown syntax
tags List of tags assigned to this dataset
index_url URL to dataset metadata file
version Latest dataset version. Each data update produces a new version ID, and version IDs can be relied on to be sortable strings.
last_change Timestamp when any entity in the dataset last changed. This marks when the system discovered the change, not when published at source. Also note that changes to our data cleaning tools may result in changes reflected here as well.
last_export Timestamp of the most recent dataset crawl and export. This is the time of when the process in question was started, not when the resulting data was uploaded to our public archive.
datasets All data sources (and enrichment datasets) included in this collection
resources Array of objects describing associated files, including exports and source data.
updated_at Use last_export instead.
coverage Coverage metadata object.
start Date of the first time the dataset was included in the database.
countries List of the countries covered by this dataset.
frequency One of: never, hourly, daily, weekly, monthly, annually
schedule A more precise (cron-style) specification of the update frequency
publisher
name Publishing source name
acronym Pubshlishing source acronym (e.g. OFAC)
description Detailed description of publishing source, uses markdown.
url Link to the publisher's home page
country Originating country (code) of publishing source
country_label Originating country (name) of publishing source
official true if the publisher is a government or inter-governmental organization.
resources
name Identifier for this export
url Direct download URL where the resource file is fetched
checksum SHA1 of the resource contents
mime_type The MIME type of the resource (eg. text/csv)
mime_type_label Human-readable label for the MIME type
title Title of the resource
size Size of the resource in bytes

Dataset query DSL

The Python library includes a small query DSL for filtering datasets by name, collection membership, or tags. It is available as followthemoney.dataset.evaluate_query:

from followthemoney.dataset import DataCatalog, Dataset, evaluate_query

catalog = DataCatalog.from_path(Dataset, "catalog.yml")
results = evaluate_query(catalog, {"and": ["#issuer.eu", "#list.sanction"]})

Grammar

A query is a recursive structure with three operators and string leaves:

DatasetQuery = str | list[DatasetQuery]
             | {"or": list[DatasetQuery]}
             | {"and": list[DatasetQuery]}
             | {"not": DatasetQuery}

Leaf values

Pattern Meaning Example
"datasetname" A specific dataset by slug "us_ofac_sdn"
"collectionname" A collection, expanded to its leaf datasets "sanctions"
"#tag" All datasets matching a tag "#issuer.eu"

The # prefix is query syntax only — the stored tag value is issuer.eu, not #issuer.eu. Collections are always expanded to their leaf datasets.

Operators

or — union of all sub-queries:

{"or": ["#list.sanction", "#list.debarment"]}

and — intersection of all sub-queries:

{"and": ["#issuer.eu", "#list.sanction"]}

not — complement (all datasets in the catalog except the matched ones):

{"not": "lt_fiu_sanctions"}

A bare list is shorthand for or, so ["a", "b"] is equivalent to {"or": ["a", "b"]}.

Examples

A single dataset:

"us_ofac_sdn"

Multiple datasets:

["us_ofac_sdn", "eu_fsf", "gb_hmt_sanctions"]

All datasets tagged as sanctions:

"#list.sanction"

EU-issued sanctions lists:

{"and": ["#issuer.eu", "#list.sanction"]}

Western sanctions and debarments, excluding a specific dataset and all Russian-issued sources:

{"and": [
    {"or": ["#issuer.west", "#list.sanction", "#list.debarment"]},
    {"not": "lt_fiu_sanctions"},
    {"not": "#issuer.ru"}
]}

Exclude multiple datasets at once using nested not/or:

{"and": [
    "#list.sanction",
    {"not": {"or": ["lt_fiu_sanctions", "ru_fedsfm"]}}
]}

String syntax

For use in URLs, CLI arguments, or configuration files, queries can also be written as compact strings using parse_query:

from followthemoney.dataset import parse_query, evaluate_query

query = parse_query("(#issuer.west|#list.sanction)-lt_fiu-#issuer.ru")
results = evaluate_query(catalog, query)

The string syntax uses three operators, listed by precedence (high to low):

Operator Meaning JSON equivalent
() Grouping Nesting
& Intersection and
- Subtraction and + not
\| Union or

& and - bind tighter than |, so a|b&c is parsed as a|(b&c).

Subtraction desugars into and + not:

String JSON AST
sanctions "sanctions"
a\|b\|c {"or": ["a", "b", "c"]}
a&b {"and": ["a", "b"]}
a-b {"and": ["a", {"not": "b"}]}
(a\|b)-c {"and": [{"or": ["a", "b"]}, {"not": "c"}]}
(#issuer.west\|#list.sanction)-lt_fiu-#issuer.ru {"and": [{"or": ["#issuer.west", "#list.sanction"]}, {"not": "lt_fiu"}, {"not": "#issuer.ru"}]}

Relevant standards

The dataset specification in FtM is largely based on Google's schema.org/Dataset, which allows for SEO-friendly markup on dataset pages. Various similar specifications exist, for example the W3C's Data Catalog Vocabulary (DCAT) and the Frictionless Data Package.

All of these specifications are roughly compatible, and it should be easy to import or export FtM metadata into any of them.

Python API

For programmatic construction, loading, and filtering of catalogs, see the dataset API reference. It covers DataCatalog (including from_path() for loading catalog YAML), Dataset, the DataResource/DataPublisher/DataCoverage sub-objects, and the evaluate_query / match_datasets / parse_query / validate_query query helpers.