Configuration

A number of configuration files are used to setup database credentials and specify various names and parameters.

Credentials

There are several configuration files that contain confidential information (like credentials) that cannot be committed to the repository. Example files are provided in the config directory for both database and s3 credentials. Simply remove the example_ prefix and populate each appropriately.

Constants

The file pipeline/default_profile.yaml contains a number of constants that are used throughout the pipeline codebase, including the paths to the credential files above as well as specific table and column names. It generally does not need to be modified for use with Johnson County’s data.

Experiment configuration

The yamls folder contains the experimental configuration that gets passed to the pipeline preprocessing and modeling scripts. Each yaml file within that folder specifies a very specific experimental configuration. See the yamls/default_sample.yaml file for an example configuration. There are several very important categories that are broken out as block comments in that sample configuration:

  • Type of Experiment rarely changes; the unit entity is a ‘person’.
  • Temporal parameters specify the time blocking, including start dates, the prediction window (in days), and the “fake today” – this specifies the date to simulate the experiment.
  • Labeling details specify which labels to consider as outcomes. The current set of labels are all based upon interactions with a given data provider. The label names are underscore-delimited, separating the data providers. All but the final component specify the population of interest and are currently exclusive of any of the other providers (this is something that we want to change). The final component specifies the outcome of interest. For example: jims_mh_jims labels the population who has previously interacted with criminal justice and mental health (but not EMS) who have a new interaction with the criminal justice system as positive labels.
  • Feature Selection specifies the feature groups that should be included in the models.
  • Model selection specifies the models to run and the parameters over which they should be parameterized. All model parameters may be lists and get combined with the cross-product so all possible combinations get tested.
  • Output file details specifies where the outputs should go.

Evaluation configuration

The variables of interest for evaluation may be specified in the pipeline/evaluation/eval_profile.yaml file. These determine which metrics get calculated for each model that was run.