Mappings

Mappings are a mechanism for generating entities from structured data sources, including tabular data and SQL databases. Mappings are defined in YAML files that describe how information from a table can be projected to FollowTheMoney entities.

Getting started

In order to map data to the FollowTheMoney model, you will need the following:

a source data table
a tool to process the mapping
and a mapping file (to direct the tool).

Source data can be either a CSV (comma-separated values) file using the UTF-8 character encoding, or a valid connection string to connect to a SQL database. Using SQL as a source also lets you perform JOINs within the database while mapping data.

In order to execute a mapping, you need to install the ftm command-line utility.

To write a mapping file, you will first need to identify:

the types of entities included in the dataset (e.g. People, Companies, Directorships)
the properties that describe each entity (e.g. the name of a Company, or the birthDate of a Person)
and the the field or combination of fields that can be used to generate a key (this used to uniquely identify each entity in the dataset). Find more details on these requirements below.

A simple mapping example

Writing a mapping file is often an iterative process, which we can gradually expand upon to refine the data model.

Below is a simple mapping file. It downloads a list of British members of parliament and transforms them into Person entities.

brexitonians.yml

gb_parliament_57:
  queries:
    - csv_url: http://bit.ly/uk-mps-csv
      entities:
        member:
          schema: Person
          keys:
            - id
          properties:
            name:
              column: name

The mapping file specifies a dataset name (gb_parliament_57) and uses a single query to pull data from a CSV file (the dataset is from the excellent EveryPolitician project). The query generates a Person entity, which maps the CSV's id column to a key, and the CSV's namecolumn to the property name

Try saving this file to your computer and executing it with the ftm command-line tool:

ftm map brexitonians.yml

The command will output a Person entity (formatted as a JSON object) for every unique record in the source table.

Assigning additional properties

However, the source CSV file has far more detail on each MP, from e-mail addresses to political party affiliation. To include this data in gb_parliament_57 , we need to map each CSV column to the respective property as defined in the FollowTheMoney schema. The properties vary based on the type of entity (a Person will have different properties from a Company).

To find out what properties exist for a particular schema, you can check out the YAML-based schema definitions on GitHub or the Model Explorer.

Here's an updated mapping file, which maps additional columns from the CSV file to properties in the Person schemata (email, nationality, and alias).

brexitoids.yml

gb_parliament_57:
  queries:
    - csv_url: http://bit.ly/uk-mps-csv
      entities:
        member:
          schema: Person
          keys:
            - id
          properties:
            name:
              column: name
            alias:
              column: sort_name
            email:
              column: email
            nationality:
              literal: GB

Generating multiple entities

Now that we've generated a detailed record for each MP, we might want to add their party membership. First, let's map a party entity (Line 12 onwards):

brexicels.yml

gb_parliament_57:
  queries:
    - csv_url: http://bit.ly/uk-mps-csv
      entities:
        member:
          schema: Person
          keys:
            - id
          properties:
            name:
              column: name
        party:
          schema: Organization
          keys:
            - group_id
          properties:
            name:
              column: group

When run this will create twice as many entities as before: the MPs, and parties. Note how each party is generated multiple times (once for each of its members). When you're using the command-line, you will need to perform entity aggregation to merge these duplicates.

Creating relationships between entities

What this does not yet do, however, is explicitly create a link between each MP and their party. In FollowTheMoney parlance, links (or relationships) are just another entity type. Note how, on lines 5 and 12 in the above mapping, we are assigning a temporary name for the member and the party. We can use these references when generating a third entity, the Membership:

brexosaurs.yml

gb_parliament_57:
  queries:
    - csv_url: http://bit.ly/uk-mps-csv
      entities:
        member:
          schema: Person
          keys:
            - id
          properties:
            name:
              column: name
        party:
          schema: Organization
          keys:
            - group_id
          properties:
            name:
              column: group
        membership:
          schema: Membership
          keys:
            - id
            - group_id
          properties:
            organization:
              entity: party
            member:
              entity: member

When loaded into a FollowTheMoney-compatible tool such as OpenAleph, this mapping would now show browsable entities for the member and each party, and list the memberships on each of their profile pages. You can also export this data to a more conventional node-graph data model for use in Neo4J or Gephi.

A more realistic complex mapping

The companies registry of the Republic of Moldova is an open dataset that consists of three separate source files that, taken together, produce a graph of company information, ownership and management:

companies.csv with companies' details like name, id, address, incorporation date;
directors.csv with names of directors and their details;
founders.csv also with names and details of the founding entities (i.e. major shareholders).

The mapping example given below describes the relationship between the companies stored in companies.csv and directors and founders, stored in directors.csv and founders.csv respectively.

moldova.yml

md_companies:
  queries:
    - csv_url: http://assets.data.occrp.org/tools/aleph/fixtures/md_companies/companies.csv
      # Entity definition section.
      entities:
        # This is an arbitrary entity name that will be used throughout this query
        # section of the mapping.
        company:
          # Entity schema type from FollowTheMoney model.
          schema: Company
          # List of columns that are used as unique identifiers for each record.
          # Could also be viewed as record aggregation when there are several
          # records for the same company that differ only in, for example, address
          # field. In this case the resulting entity will contain address values
          # merged from different source data records.
          keys:
            - IDNO
            - Denumirea_completă
          # A set of properties that describe the chosen schema type.
          # For each property one or several columns can be used to get value from.
          # A literal string value could be given instead of a column value,
          # e.g. for a country code.
          properties:
            name:
              column: Denumirea_completă
            registrationNumber:
              column: IDNO
            incorporationDate:
              column: Data_înregistrării
            address:
              column: Adresa
            jurisdiction:
              literal: MD
            legalForm:
              column: Forma_org
            status:
              column: Statutul
    - csv_url: http://assets.data.occrp.org/tools/aleph/fixtures/md_companies/directors.csv
      # With this query Director records are loaded and the Directorship
      # relation is defined between Directors and Companies.
      entities:
        # Again a Company entity is constructed using the same set of keys
        # as in the query above in order to be referred to in the Directorship
        # event definition.
        company:
          schema: Company
          keys:
            - Company_IDNO
            - Company_Name
        director:
          schema: LegalEntity
          keys:
            - Company_Name
            - Company_IDNO
            - Director
          properties:
            name:
              column: Director
              # To only include records that have a non-empty `Director` column.
              required: true
        directorship:
          schema: Directorship
          # To avoid key collision between directors and directorships an additional
          # literal string value is given with `key_literal`.
          key_literal: Directorship
          keys:
            - Company_Name
            - Company_IDNO
            - Director
          properties:
            # Linking together directors and companies, where the director and
            # organization properties of the Directorship interval contain references
            # to the director and company entities that were constructed previously.
            director:
              entity: director
              required: true
            organization:
              entity: company
              required: true
            role:
              literal: director
    # Similar to directors, in order to link founders to companies through
    # an ownership event th company and founder entities have to be declared
    # again in each query sectio.
    - csv_url: http://assets.data.occrp.org/tools/aleph/fixtures/md_companies/founders.csv
      entities:
        company:
          schema: Company
          keys:
            - Company_IDNO
            - Company_Name
        founder:
          schema: LegalEntity
          keys:
            - Company_Name
            - Company_IDNO
            - Founder
          properties:
            name:
              column: Founder
              required: true
        ownership:
          schema: Ownership
          key_literal: Ownership
          keys:
            - Company_Name
            - Company_IDNO
            - Founder
          properties:
            owner:
              entity: founder
              required: true
            asset:
              entity: company
              required: true
            role:
              literal: founder
    # In case there're extra tables with data that has to be linked companies,
    # the the same set of keys is repeated and the relevant properties
    # are declared.
    - csv_url: http://assets.data.occrp.org/tools/aleph/fixtures/md_companies/licensed.csv
      entities:
        company:
          schema: Company
          keys:
            - Company_IDNO
            - Company_Name
          properties:
            sector:
              column: Denumire
    - csv_url: http://assets.data.occrp.org/tools/aleph/fixtures/md_companies/unlicensed.csv
      entities:
        company:
          schema: Company
          keys:
            - Company_IDNO
            - Company_Name
          properties:
            sector:
              column: Denumire
            caemCode:
              column: Cod_CAEM

Generating unique keys

When creating entities from a dataset, each generated entity must be assigned a unique ID. This ID is computed from the keys defined in the mapping file. When writing the file, it is therefore necessary to understand what combination of source columns from the original table can be used to uniquely identify an entity in the context of a dataset. Failing to do so will result in "key collisions", a problem that results in variety or errors which are sometimes hard to diagnose:

Entities' properties contain values from different unrelated records (e.g. addresses, dates of birth);
Wrong entity types (Persons are generated as LegalEntities instead)
Related entities are merged together in various ways;
Error messages are shown when trying to load the mapping (e.g. Cannot index abstract schema or No common ancestor: ... )

For example, given a table of people with their personal details the mapping below might not always be valid, because different people can have the same first and last name (and thus a key collision will happen).

bad.yml

entities:
  person:
    schema: Person
    keys:
      - FirstName
      - LastName
    properties:
      firstName:
        column: FirstName
      lastName:
        column: LastName
      birthDate:
        column: DoB

The solution is to include in the list of keys as many properties as is necessary and sufficient to eliminate any intersection between unrelated entities of the same type.

good.yml

entities:
  person:
    schema: Person
    keys:
      - FirstName
      - LastName
      - DoB
    properties:
      firstName:
        column: FirstName
      lastName:
        column: LastName
      birthDate:
        column: DoB

Keys for events (Ownership, Sanction, Family) will usually be a product of keys of the entities that such an event links together.

combined.yml

entities:
  company:
    schema: Company
    keys:
      - company_name
  owner:
    schema: Person
    keys:
      - owner_name
  ownership:
    schema: Ownership
    keys:
      - company_name
      - owner_name

Loading a mapping from a SQL database

In the examples shown above, data has been loaded from CSV files. The mapping system can also connect to a SQL database using SQLAlchemy. Depending on the database system used, further Python drivers (such as psycopg2 or mysqlclient) might be required for specific backends.

When loading from a SQL database, you can begin your query with a specification of the tables you wish to access, and how they should be joined:

database.yml

za_cipc:
  queries:
    - database: postgresql://localhost/cipc
      tables:
        - table: za_cipc_companies
          alias: companies
        - table: za_cipc_directors
          alias: directors
      joins:
        - left: companies.regno
          right: directors.company_regno

Please note that when you query more than one table at the same time, all the column names used in the mapping need to be qualified with the table name, ie. companies.name or directors.name instead of just name.

Mappings support substitution of environment variables. Instead of storing your database credentials to a mapping file, you might want to reference an environment variable like ${DATABASE_URI} in the mapping file, and define the username and password externally.

Filtering source data

When loading data from a mapping, you may sometimes want to filter the data so that only part of a table is imported. FollowTheMoney mappings will only let you do equality filters; anything more complex than that should be considered data cleaning and be done prior to executing the mapping.

filters.yml

gb_parliament_57:
  queries:
    - csv_url: http://bit.ly/uk-mps-csv
      filters:
        group: 'Conservative'
      filters_not:
        gender: 'male'
      entities:
        member:
          schema: Person
          keys:
            - id
          properties:
            name:
              column: name

Extra functions for property values

Mapping a column value to a property is normally a straight copy operation:

[...]
properties:
  name:
    column: person_name

There are some tricks available, however:

hacks.yml

# Setting multiple values for a property:
properties:
  name:
    columns:
    - person_name
    - maiden_name

# Merging values ad-hoc:
properties:
  name:
    columns:
    - first_name
    - last_name
    join: " "

# Setting a constant value:
properties:
  country:
    literal: "SS"

# Defining a date format:
properties:
  birthDate:
    column: dob
    format: "%d.%m.%Y"

In general, we are not seeking to incorporate further data cleaning functionality into the mapping process. It's generally a good idea to design your data pipeline such that loading entities via a mapping is preceded by data cleanup step in which necessary normalisations are applied.