Datasets and catalogs
Python API for loading, constructing, and filtering FtM datasets and data catalogs. For the conceptual model and the YAML structure, see Datasets and data catalogs.
A typical usage pattern is to load a catalog from a YAML file, look up a dataset by name, and iterate its resources:
from followthemoney.dataset import DataCatalog, Dataset
catalog = DataCatalog.from_path(Dataset, "catalog.yml")
ofac = catalog.require("us_ofac_sdn")
for resource in ofac.resources:
print(resource.name, resource.url)
For filtering a catalog against a dataset query, use evaluate_query or match_datasets:
from followthemoney.dataset import evaluate_query, parse_query
query = parse_query("#list.sanction & #issuer.eu")
matched = evaluate_query(catalog, query)
Catalog
followthemoney.dataset.DataCatalog
Bases: Generic[DS]
A data catalog is a collection of datasets. It provides methods for retrieving or creating datasets, and for checking if a dataset exists in the catalog.
add(dataset)
Add a dataset to the catalog. If the dataset already exists, it will be updated.
get(name)
Get a dataset by name. Returns None if the dataset does not exist.
has(name)
Check if a dataset exists in the catalog.
make_dataset(data)
Create a new dataset from the given data. If a dataset with the same name already exists, it will be updated.
names()
Get the names of all datasets in the catalog.
require(name)
Get a dataset by name. Raises MetadataException if the dataset does not exist.
Dataset
followthemoney.dataset.Dataset
A container for entities, often from one source or related to one topic. A dataset is a set of data, sez W3C.
leaves()
All contained datasets which are not collections (can be 'self').
to_dict()
Convert the dataset to a dictionary representation.
Dataset metadata sub-objects
followthemoney.dataset.DataResource
Bases: BaseModel
A downloadable resource that is part of a dataset.
followthemoney.dataset.DataPublisher
Bases: BaseModel
Publisher information, eg. the government authority.
followthemoney.dataset.DataCoverage
Bases: BaseModel
Details on the temporal and geographic scope of a dataset.
Query functions
followthemoney.dataset.evaluate_query(catalog, query)
Evaluate a query AST against a catalog, returning matching leaf datasets.
The query is a dictionary-like structure using "or", "and", and "not" operators. Leaf strings are dataset names, collection names, or "#tag" selectors. A bare list is treated as an implicit "or".
followthemoney.dataset.match_datasets(query, datasets)
Test whether a set of dataset names matches a query.
Like evaluate_query but works against plain name strings instead of
a full catalog. Tag selectors (#...) are not supported.
The caller is responsible for validating the query once upfront via
validate_query or parse_query (which produces valid ASTs by
construction). This function skips validation so it can be used in
tight loops over millions of entities.
followthemoney.dataset.parse_query(text)
Parse a string query into a DatasetQuery AST.
Syntax: (#issuer.west|#list.sanction)-lt_fiu-#issuer.ru
Operators by precedence (high to low):
- () grouping
- & intersection, - subtraction (same precedence, left-to-right)
- | union
followthemoney.dataset.validate_query(query)
Check that a query conforms to the DSL grammar. Raises InvalidDatasetQuery on failure.