Datasets and catalogs

Python API for loading, constructing, and filtering FtM datasets and data catalogs. For the conceptual model and the YAML structure, see Datasets and data catalogs.

A typical usage pattern is to load a catalog from a YAML file, look up a dataset by name, and iterate its resources:

from followthemoney.dataset import DataCatalog, Dataset

catalog = DataCatalog.from_path(Dataset, "catalog.yml")
ofac = catalog.require("us_ofac_sdn")
for resource in ofac.resources:
    print(resource.name, resource.url)

For filtering a catalog against a dataset query, use evaluate_query or match_datasets:

from followthemoney.dataset import evaluate_query, parse_query

query = parse_query("#list.sanction & #issuer.eu")
matched = evaluate_query(catalog, query)

Catalog

`followthemoney.dataset.DataCatalog`

Bases: Generic[DS]

A data catalog is a collection of datasets. It provides methods for retrieving or creating datasets, and for checking if a dataset exists in the catalog.

`add(dataset)`

Add a dataset to the catalog. If the dataset already exists, it will be updated.

`get(name)`

Get a dataset by name. Returns None if the dataset does not exist.

`has(name)`

Check if a dataset exists in the catalog.

`make_dataset(data)`

Create a new dataset from the given data. If a dataset with the same name already exists, it will be updated.

`names()`

Get the names of all datasets in the catalog.

`require(name)`

Get a dataset by name. Raises MetadataException if the dataset does not exist.

Dataset

`followthemoney.dataset.Dataset`

A container for entities, often from one source or related to one topic. A dataset is a set of data, sez W3C.

`leaves()`

All contained datasets which are not collections (can be 'self').

`to_dict()`

Convert the dataset to a dictionary representation.

Dataset metadata sub-objects

`followthemoney.dataset.DataResource`

Bases: BaseModel

A downloadable resource that is part of a dataset.

`followthemoney.dataset.DataPublisher`

Bases: BaseModel

Publisher information, eg. the government authority.

`followthemoney.dataset.DataCoverage`

Bases: BaseModel

Details on the temporal and geographic scope of a dataset.

Query functions

`followthemoney.dataset.evaluate_query(catalog, query)`

Evaluate a query AST against a catalog, returning matching leaf datasets.

The query is a dictionary-like structure using "or", "and", and "not" operators. Leaf strings are dataset names, collection names, or "#tag" selectors. A bare list is treated as an implicit "or".

`followthemoney.dataset.match_datasets(query, datasets)`

Test whether a set of dataset names matches a query.

Like evaluate_query but works against plain name strings instead of a full catalog. Tag selectors (#...) are not supported.

The caller is responsible for validating the query once upfront via validate_query or parse_query (which produces valid ASTs by construction). This function skips validation so it can be used in tight loops over millions of entities.

`followthemoney.dataset.parse_query(text)`

Parse a string query into a DatasetQuery AST.

Syntax: (#issuer.west|#list.sanction)-lt_fiu-#issuer.ru

Operators by precedence (high to low): - () grouping - & intersection, - subtraction (same precedence, left-to-right) - | union

`followthemoney.dataset.validate_query(query)`

Check that a query conforms to the DSL grammar. Raises InvalidDatasetQuery on failure.