Data extraction and cleaning¶
ETL is handled by drake
, which is expected to be called from the top (or
root) directory of this repository. All paths assume that drake was invoked
from the root of the repository.
There are four conceptual steps: extract, clean, deduplicate, and process.
Extract raw data¶
The scripts here expect to find a zipped data dump in a data
directory in
the repository root. It will extract that, and restore into the database. The
database configuration is specified within a config
directory in the
repository root.
Clean relevant tables¶
The raw tables, as restored from the database dump, get placed into the public
schema of the database. Before being used by the pipeline, all tables go through
a cleaning and normalization process. All scripts contained within the
etl/data_cleaning
directory are executed at this point.
Deduplicate identities¶
Deduplication is handled by superdeduper.
A SQL script creates a master “entries” table by combining all relevant columns
from all the data sources, and then superdeduper is called with the saved
configuration in etl/dedupe/config.yaml
. After a dedupe_id
is appended
to the entries table, the apply_results.py
script goes back to the
specified tables in the clean schema to append a dedupe_id column.
Further processing¶
Finally, a few SQL scripts are used to create computed tables for convenience.
These include things like a timeline of events. All scripts contained within the
etl/data_processing
directory are executed at this point.