Skip to content

Identifiers

Used for registration numbers and other codes assigned by an authority to identify an entity. This might include tax identifiers and statistical codes.

Attribute Value Detail
name identifier Used in schema definitions
label Identifier plural: Identifiers
group identifiers Used in search indexing to query all properties of a given type
matchable Suitable for use in entity matching
pivot Suitable for use as a pivot point for connecting to other entities

Identifier formats

Identifier properties can specify a format, which names a more precise validation mechanism for values assigned to these properties. The validators will enforce constraints on value length or use a checksum mechanism (as defined in rigour.ids). Some identifiers are considered as strong, meaning they are part of a well-defined, often global, numbering scheme.

Code Label Strong Description
bic, swift BIC BIC (ISO 9362 Business identifier codes).
cnpj CNPJ Cadastro Nacional de Pessoas Jurídicas, Brazilian national companies identifier
cpf CPF Cadastro de Pessoas Físicas, Brazilian national identifier
figi, openfigi FIGI A FIGI number for a security, as managed by OpenFIGI.
generic, null Generic identifier Base class for identifier types.
iban IBAN An IBAN number for a bank account.
imo IMO An IMO number for a ship or shipping company
inn INN Russian tax identification number.
isin ISIN An ISIN number for a security.
lei LEI Legal Entity Identifier (ISO 17442)
npi NPI National Provider Identifier.
uei NPI US GSA Unique Entity ID.
ogrn OGRN Primary State Registration Number (Russian company registration).
ssn SSN US Social Security Number
strict Strict identifier A generic identifier type that applies harsh normalization.
uscc USCC Unified Social Credit Identifier, a Chinese national identifier
wikidata, qid Wikidata QID A wikidata item identifier.

followthemoney.types.IdentifierType

Bases: PropertyType

Used for registration numbers and other codes assigned by an authority to identify an entity. This might include tax identifiers and statistical codes.

Since identifiers are high-value criteria when comparing two entities, numbers should only be modelled as identifiers if they are long enough to be meaningful. Four- or five-digit industry classifiers create more noise than value.

Source code in followthemoney/types/identifier.py
class IdentifierType(PropertyType):
    """Used for registration numbers and other codes assigned by an authority
    to identify an entity. This might include tax identifiers and statistical
    codes.

    Since identifiers are high-value criteria when comparing two entities, numbers
    should only be modelled as identifiers if they are long enough to be meaningful.
    Four- or five-digit industry classifiers create more noise than value."""

    COMPARE_CLEAN = re.compile(r"[\W_]+")
    name = const("identifier")
    group = const("identifiers")
    label = _("Identifier")
    plural = _("Identifiers")
    matchable = True
    pivot = True
    max_length = 64

    def clean_text(
        self,
        text: str,
        fuzzy: bool = False,
        format: Optional[str] = None,
        proxy: Optional["EntityProxy"] = None,
    ) -> Optional[str]:
        if format in get_identifier_format_names():
            format_ = get_identifier_format(format)
            return format_.normalize(text)
        return text

    def clean_compare(self, value: str) -> str:
        # TODO: should this be used for normalization?
        value = self.COMPARE_CLEAN.sub("", value)
        return value.lower()

    def compare(self, left: str, right: str) -> float:
        left = self.clean_compare(left)
        right = self.clean_compare(right)
        if left == right:
            return 1.0
        elif left in right or right in left:
            return len(shortest(left, right)) / len(longest(left, right))
        return 0.0

    def _specificity(self, value: str) -> float:
        return dampen(4, 10, value)

    def node_id(self, value: str) -> str:
        return f"id:{value}"

    def caption(self, value: str, format: Optional[str] = None) -> str:
        if format in get_identifier_format_names():
            format_ = get_identifier_format(format)
            return format_.format(value)
        return value