Who Are You?!?!
Entity resolution is the process of identifying and linking disparate data records that refer to the same person or organization (collectively referred to as an “entity”) across multiple data sources. Unfortunately, there are additional complexities since data sources often lack a common framework, schema, and/or data structures. Data is heterogenous and can be structured, unstructured, or semi-structured. So how do we solve this?
Data, Information, & Intelligence Relationship
At Prae, as former Intelligence Officers, we LOVE the simplicity of the Joint Intelligence / Joint Publication 2.0 Intelligence Model. This model is the core of our philosophy and technical approach. Here, our operational environment is the public domain where data is either voluntarily disclosed publicly (e.g. social media posts) or is legally accessibly by anyone (e.g. public court or property records). These raw records and/or observations are collected as data and processed into information. Information can be analyzed and interpreted by humans and/or machines for meaning. This interpreted meaning ultimately produces intelligence, which can then be actioned by humans and organizations to manage risk. Now let’s dive deeper into the technology…
Data Engineering & ETL
In order to tackle both the volume and complexity in the processing and exploitation of publicly available information (PAI), we needed to design an architecture to process PAI at scale with repeatable and predictable results. The diagram below illustrates our end-to-end dataflow and data engineering processes.
Step 1: Defining Data Sources & Inputs
Data sources are ingested and stored in alignment with the medallion architecture: Bronze (raw), Silver (cleaned and validated), and Gold (aggregated and enriched). At Prae, we employ 2x Bronze Layers. Bronze Layer-1 is for the storage and archiving for all truly raw data whereas Bronze Layer-2 is lightly processed data that is staged for pickup by our ETL automation. Data is also organized by the type of collection operations: opportunistic (highlighted in red), hybrid (highlighted in pink), and targeted (highlighted in dark red).
Opportunistic is the collection of data in which records and/or observations related to any and all entities can be exploited for consumption and use. Sources typically include databases and flat files such as CSV, JSON, etc.
Hybrid is the collection of data in which records and/or observations can be both opportunistic and targeted. These sources are typically from third-party subscriptions such as data brokers and sensitive closed-door relationships.
Targeted is the collection of data in which records and/or observations is directed at a specific entity for consumption and use. Sources typically include tables and an entity’s posting such as social media, blogs, etc.
Step 2: Extraction of Metadata & Unique Identifiers
As previously mentioned, raw data is lightly processed and staged in Bronze Layer-2 for pickup. This enables us to kick-off our ETL architecture and the automated bulk extraction of metadata and any relevant unique identifiers (UIDs), which gets stored in a Silver Layer. Metadata is data that defines and describes the characteristics of data in order to more efficiently search, retrieve, use, and manage it.
Within metadata are identifiers, which are unique strings of numbers, letters, and/or symbols assigned to an entity, object, or data record. Ideally, your universally unique identifiers (UUIDs) are distinct enough to both dedupe and delineate instances in which entities share a common value such as names, usernames, etc. At present, we have defined and implemented 174x different UUID domains as part of our proprietary data schema and ontology model. This translates downstream to greater entity resolution accuracy and precision.
Step 3: Transformation of Data to Information
First, all data is transformed into Apache Parquet files, a columnar data format ideal for compression and read-heavy workloads. This is then wrapped into an advanced storage layer and full table management platform via Databricks Delta Tables, which allows us to take advantage of full ACID transactions. ACID is an acronym for the set of 4x key properties that define a data transaction: Atomicity, Consistency, Isolation, and Durability. If a database has ACID properties, then it can be called an ACID transaction. ACID transactions guarantee that each read, write, and/or modification to a table has the following properties:
Atomicity - ensures that each read, write, modify, or delete (transaction) statement is treated as a single unit. Either the entire or none of that statement is executed. This property prevents data loss and corruption from occurring.
Consistency - ensures that transactions only make changes to tables in predefined, predictable ways. This consistency protects against data corruption or data errors from creating unintended consequences in the integrity of your tables.
Isolation - ensures that concurrent user transactions do NOT interfere or affect one another when multiple users are reading and writing from the same table all at once. Each request can occur as though they were isolated, occurring one by one, despite actually occurring simultaneously.
Durability - ensures that changes to your data made by successfully executed transactions will be saved and persistent, even in the event of system failure.
In addition to the Parquet/Delta Table transform, we also transform each of the 174x UUIDs into our proprietary entity schema, ontology model, and its appropriate Delta Table. Strings resembling a name are aggregated and enriched according to whether an entity is a person or organization. Similarly, strings, numbers, and symbols such as emails, phone numbers, usernames, etc are aggregated and enriched to UIDs.
Step 4: Loading Information into a Data Lakehouse
The last step of the ETL process is the loading of information into a Gold Layer for our analysts, machine learning (ML) models, and customers to consume and interpret for meaning. For this, we leverage a Data Lakehouse, a modern data architecture that combines and integrates the flexible, low-cost, and scalability of a data lake with the structure, reliability, and governance of a data warehouse. Some key characteristics of this design architecture are ACID transactions, schema enforcement, unified governance (of data lineage, version control, and access controls), and an integrated, scalable store/compute resource.
It’s also worth noting that a Data Lakehouse contains a few key technology advancements and advantages, namely: metadata layers, new query engine designs providing high-performance SQL execution, and optimized access for data science and ML models. In turn, metadata layers enable other features such as streaming I/O support (thus eliminating the need for message buses like Apache Kafka), time travel to old table versions, schema enforcement & evolution, and data validation.