Mappings
Mappings are a mechanism for generating entities from structured data sources, including tabular data and SQL databases. Mappings are defined in YAML files that describe how information from a table can be projected to FollowTheMoney entities.
Getting started
In order to map data to the FollowTheMoney model, you will need the following:
- a source data table
- a tool to process the mapping
- and a mapping file (to direct the tool).
Source data can be either a CSV (comma-separated values) file using the UTF-8 character encoding, or a valid connection string to connect to a SQL database. Using SQL as a source also lets you perform JOINs within the database while mapping data.
In order to execute a mapping, you need to install the ftm command-line utility.
To write a mapping file, you will first need to identify:
- the types of entities included in the dataset (e.g.
People
,Companies
,Directorships
) - the properties that describe each entity (e.g. the
name
of aCompany
, or thebirthDate
of aPerson
) - and the the field or combination of fields that can be used to generate a
key
(this used to uniquely identify each entity in the dataset). Find more details on these requirements below.
A simple mapping example
Writing a mapping file is often an iterative process, which we can gradually expand upon to refine the data model.
Below is a simple mapping file. It downloads a list of British members of parliament and transforms them into Person
entities.
gb_parliament_57:
queries:
- csv_url: http://bit.ly/uk-mps-csv
entities:
member:
schema: Person
keys:
- id
properties:
name:
column: name
The mapping file specifies a dataset name (gb_parliament_57
) and uses a single query to pull data from a CSV file (the dataset is from the excellent EveryPolitician project). The query generates a Person
entity, which maps the CSV’s id
column to a key, and the CSV’s name
column to the property name
Try saving this file to your computer and executing it with the ftm command-line tool:
ftm map brexitonians.yml
The command will output a Person
entity (formatted as a JSON object) for every unique record in the source table.
Assigning additional properties
However, the source CSV file has far more detail on each MP, from e-mail addresses to political party affiliation. To include this data in gb_parliament_57
, we need to map each CSV column to the respective property as defined in the FollowTheMoney schema. The properties vary based on the type of entity (a Person
will have different properties from a Company
).
To find out what properties exist for a particular schema, you can check out the YAML-based schema definitions on GitHub or the Model Explorer.
Here’s an updated mapping file, which maps additional columns from the CSV file to properties in the Person
schemata (email
, nationality
, and alias
).
gb_parliament_57:
queries:
- csv_url: http://bit.ly/uk-mps-csv
entities:
member:
schema: Person
keys:
- id
properties:
name:
column: name
alias:
column: sort_name
email:
column: email
nationality:
literal: GB
Generating multiple entities
Now that we’ve generated a detailed record for each MP, we might want to add their party membership. First, let’s map a party entity (Line 12 onwards):
gb_parliament_57:
queries:
- csv_url: http://bit.ly/uk-mps-csv
entities:
member:
schema: Person
keys:
- id
properties:
name:
column: name
party:
schema: Organization
keys:
- group_id
properties:
name:
column: group
When run this will create twice as many entities as before: the MPs, and parties. Note how each party is generated multiple times (once for each of its members). When you’re using the command-line, you will need to perform entity aggregation to merge these duplicates.
Creating relationships between entities
What this does not yet do, however, is explicitly create a link between each MP and their party. In FollowTheMoney parlance, links (or relationships) are just another entity type. Note how, on lines 5 and 12 in the above mapping, we are assigning a temporary name for the member
and the party
. We can use these references when generating a third entity, the Membership
:
gb_parliament_57:
queries:
- csv_url: http://bit.ly/uk-mps-csv
entities:
member:
schema: Person
keys:
- id
properties:
name:
column: name
party:
schema: Organization
keys:
- group_id
properties:
name:
column: group
membership:
schema: Membership
keys:
- id
- group_id
properties:
organization:
entity: party
member:
entity: member
When loaded into a FollowTheMoney-compatible tool such as Aleph, this mapping would now show browsable entities for the member and each party, and list the memberships on each of their profile pages. You can also export this data to a more conventional node-graph data model for use in Neo4J or Gephi.
A more realistic complex mapping
The companies registry of the Republic of Moldova is an open dataset that consists of three separate source files that, taken together, produce a graph of company information, ownership and management:
companies.csv
with companies’ details like name, id, address, incorporation date;directors.csv
with names of directors and their details;founders.csv
also with names and details of the founding entities (i.e. major shareholders).
The mapping example given below describes the relationship between the companies stored in companies.csv
and directors and founders, stored in directors.csv
and founders.csv
respectively.
md_companies:
queries:
- csv_url: http://assets.data.occrp.org/tools/aleph/fixtures/md_companies/companies.csv
# Entity definition section.
entities:
# This is an arbitrary entity name that will be used throughout this query
# section of the mapping.
company:
# Entity schema type from FollowTheMoney model.
schema: Company
# List of columns that are used as unique identifiers for each record.
# Could also be viewed as record aggregation when there are several
# records for the same company that differ only in, for example, address
# field. In this case the resulting entity will contain address values
# merged from different source data records.
keys:
- IDNO
- Denumirea_completă
# A set of properties that describe the chosen schema type.
# For each property one or several columns can be used to get value from.
# A literal string value could be given instead of a column value,
# e.g. for a country code.
properties:
name:
column: Denumirea_completă
registrationNumber:
column: IDNO
incorporationDate:
column: Data_înregistrării
address:
column: Adresa
jurisdiction:
literal: MD
legalForm:
column: Forma_org
status:
column: Statutul
- csv_url: http://assets.data.occrp.org/tools/aleph/fixtures/md_companies/directors.csv
# With this query Director records are loaded and the Directorship
# relation is defined between Directors and Companies.
entities:
# Again a Company entity is constructed using the same set of keys
# as in the query above in order to be referred to in the Directorship
# event definition.
company:
schema: Company
keys:
- Company_IDNO
- Company_Name
director:
schema: LegalEntity
keys:
- Company_Name
- Company_IDNO
- Director
properties:
name:
column: Director
# To only include records that have a non-empty `Director` column.
required: true
directorship:
schema: Directorship
# To avoid key collision between directors and directorships an additional
# literal string value is given with `key_literal`.
key_literal: Directorship
keys:
- Company_Name
- Company_IDNO
- Director
properties:
# Linking together directors and companies, where the director and
# organization properties of the Directorship interval contain references
# to the director and company entities that were constructed previously.
director:
entity: director
required: true
organization:
entity: company
required: true
role:
literal: director
# Similar to directors, in order to link founders to companies through
# an ownership event th company and founder entities have to be declared
# again in each query sectio.
- csv_url: http://assets.data.occrp.org/tools/aleph/fixtures/md_companies/founders.csv
entities:
company:
schema: Company
keys:
- Company_IDNO
- Company_Name
founder:
schema: LegalEntity
keys:
- Company_Name
- Company_IDNO
- Founder
properties:
name:
column: Founder
required: true
ownership:
schema: Ownership
key_literal: Ownership
keys:
- Company_Name
- Company_IDNO
- Founder
properties:
owner:
entity: founder
required: true
asset:
entity: company
required: true
role:
literal: founder
# In case there're extra tables with data that has to be linked companies,
# the the same set of keys is repeated and the relevant properties
# are declared.
- csv_url: http://assets.data.occrp.org/tools/aleph/fixtures/md_companies/licensed.csv
entities:
company:
schema: Company
keys:
- Company_IDNO
- Company_Name
properties:
sector:
column: Denumire
- csv_url: http://assets.data.occrp.org/tools/aleph/fixtures/md_companies/unlicensed.csv
entities:
company:
schema: Company
keys:
- Company_IDNO
- Company_Name
properties:
sector:
column: Denumire
caemCode:
column: Cod_CAEM
Generating unique keys
When creating entities from a dataset, each generated entity must be assigned a unique ID. This ID is computed from the keys
defined in the mapping file. When writing the file, it is therefore necessary to understand what combination of source columns from the original table can be used to uniquely identify an entity in the context of a dataset. Failing to do so will result in “key collisions”, a problem that results in variety or errors which are sometimes hard to diagnose:
- Entities’ properties contain values from different unrelated records (e.g. addresses, dates of birth);
- Wrong entity types (
Persons
are generated asLegalEntities
instead) - Related entities are merged together in various ways;
- Error messages are shown when trying to load the mapping (e.g.
Cannot index abstract schema
orNo common ancestor: ...
)
For example, given a table of people with their personal details the mapping below might not always be valid, because different people can have the same first and last name (and thus a key collision will happen).
entities:
person:
schema: Person
keys:
- FirstName
- LastName
properties:
firstName:
column: FirstName
lastName:
column: LastName
birthDate:
column: DoB
The solution is to include in the list of keys as many properties as is necessary and sufficient to eliminate any intersection between unrelated entities of the same type.
entities:
person:
schema: Person
keys:
- FirstName
- LastName
- DoB
properties:
firstName:
column: FirstName
lastName:
column: LastName
birthDate:
column: DoB
Keys for events (Ownership, Sanction, Family) will usually be a product of keys of the entities that such an event links together.
entities:
company:
schema: Company
keys:
- company_name
owner:
schema: Person
keys:
- owner_name
ownership:
schema: Ownership
keys:
- company_name
- owner_name
Loading a mapping from a SQL database
In the examples shown above, data has been loaded from CSV files. The mapping system can also connect to a SQL database using SQLAlchemy. Depending on the database system used, further Python drivers (such as psycopg2
or mysqlclient
) might be required for specific backends.
When loading from a SQL database, you can begin your query with a specification of the tables you wish to access, and how they should be joined:
za_cipc:
queries:
- database: postgresql://localhost/cipc
tables:
- table: za_cipc_companies
alias: companies
- table: za_cipc_directors
alias: directors
joins:
- left: companies.regno
right: directors.company_regno
Please note that when you query more than one table at the same time, all the column names used in the mapping need to be qualified with the table name, ie. companies.name
or directors.name
instead of just name
.
Mappings support substitution of environment variables. Instead of storing your database credentials to a mapping file, you might want to reference an environment variable like ${DATABASE_URI}
in the mapping file, and define the username and password externally.
Filtering source data
When loading data from a mapping, you may sometimes want to filter the data so that only part of a table is imported. FollowTheMoney mappings will only let you do equality filters; anything more complex than that should be considered data cleaning and be done prior to executing the mapping.
gb_parliament_57:
queries:
- csv_url: http://bit.ly/uk-mps-csv
filters:
group: 'Conservative'
filters_not:
gender: 'male'
entities:
member:
schema: Person
keys:
- id
properties:
name:
column: name
Extra functions for property values
Mapping a column value to a property is normally a straight copy operation:
[...]
properties:
name:
column: person_name
There are some tricks available, however:
# Setting multiple values for a property:
properties:
name:
columns:
- person_name
- maiden_name
# Merging values ad-hoc:
properties:
name:
columns:
- first_name
- last_name
join: " "
# Setting a constant value:
properties:
country:
literal: "SS"
# Defining a date format:
properties:
birthDate:
column: dob
format: "%d.%m.%Y"
In general, we are not seeking to incorporate further data cleaning functionality into the mapping process. It’s generally a good idea to design your data pipeline such that loading entities via a mapping is preceded by data cleanup step in which necessary normalisations are applied.