Skip to content

Datasets and catalogs

Python API for loading, constructing, and filtering FtM datasets and data catalogs. For the conceptual model and the YAML structure, see Datasets and data catalogs.

A typical usage pattern is to load a catalog from a YAML file, look up a dataset by name, and iterate its resources:

from followthemoney.dataset import DataCatalog, Dataset

catalog = DataCatalog.from_path(Dataset, "catalog.yml")
ofac = catalog.require("us_ofac_sdn")
for resource in ofac.resources:
    print(resource.name, resource.url)

For filtering a catalog against a dataset query, use evaluate_query or match_datasets:

from followthemoney.dataset import evaluate_query, parse_query

query = parse_query("#list.sanction & #issuer.eu")
matched = evaluate_query(catalog, query)

Catalog

followthemoney.dataset.DataCatalog

Bases: Generic[DS]

A data catalog is a collection of datasets. It provides methods for retrieving or creating datasets, and for checking if a dataset exists in the catalog.

add(dataset)

Add a dataset to the catalog. If the dataset already exists, it will be updated.

get(name)

Get a dataset by name. Returns None if the dataset does not exist.

has(name)

Check if a dataset exists in the catalog.

make_dataset(data)

Create a new dataset from the given data. If a dataset with the same name already exists, it will be updated.

names()

Get the names of all datasets in the catalog.

require(name)

Get a dataset by name. Raises MetadataException if the dataset does not exist.

Dataset

followthemoney.dataset.Dataset

A container for entities, often from one source or related to one topic. A dataset is a set of data, sez W3C.

leaves()

All contained datasets which are not collections (can be 'self').

to_dict()

Convert the dataset to a dictionary representation.

Dataset metadata sub-objects

followthemoney.dataset.DataResource

Bases: BaseModel

A downloadable resource that is part of a dataset.

followthemoney.dataset.DataPublisher

Bases: BaseModel

Publisher information, eg. the government authority.

followthemoney.dataset.DataCoverage

Bases: BaseModel

Details on the temporal and geographic scope of a dataset.

Query functions

followthemoney.dataset.evaluate_query(catalog, query)

Evaluate a query AST against a catalog, returning matching leaf datasets.

The query is a dictionary-like structure using "or", "and", and "not" operators. Leaf strings are dataset names, collection names, or "#tag" selectors. A bare list is treated as an implicit "or".

followthemoney.dataset.match_datasets(query, datasets)

Test whether a set of dataset names matches a query.

Like evaluate_query but works against plain name strings instead of a full catalog. Tag selectors (#...) are not supported.

The caller is responsible for validating the query once upfront via validate_query or parse_query (which produces valid ASTs by construction). This function skips validation so it can be used in tight loops over millions of entities.

followthemoney.dataset.parse_query(text)

Parse a string query into a DatasetQuery AST.

Syntax: (#issuer.west|#list.sanction)-lt_fiu-#issuer.ru

Operators by precedence (high to low): - () grouping - & intersection, - subtraction (same precedence, left-to-right) - | union

followthemoney.dataset.validate_query(query)

Check that a query conforms to the DSL grammar. Raises InvalidDatasetQuery on failure.