pipeline.modeling package

Submodules

pipeline.modeling.feature_model_grabber module

class pipeline.modeling.feature_model_grabber.FeatureModelGrabber(fake_today, prediction_window, config, feature_timestamp, s3_profile, discard_model)[source]
add_labels_to_feature_sets(feature_sets, labels)[source]
combine_models_labels_features(models, labelled_features)[source]
compile_results(res, bulk_model_list, force_write=False)[source]

After a model is run, compile the model information and the predictions. Temporarily stash them in csvs. If more than 49 models have been stashed or this is the last model to be run, copy the csvs to the models and predictions table, remove the csvs, and return an empty list.

Parameters:
  • self (FeatureModelGrabber) – inherit object properties
  • res (dict) – dictionary of model information
  • bulk_model_list (list) – list of model info to be saved to database
  • force_write (bool) – should the stashed info be saved to the database regardless of length?
Returns:

list of models run since last write

Return type:

list

connect_to_s3()[source]

Open a connection to s3 and return the resource objects and a dictionary of s3 configuration details.

Returns:s3 resource and s3_config
Return type:boto3 resource and dict
csv_to_database(file_name, table_name)[source]

Given a csv name and a database table name, append the contents of the csv to the database table and remove the csv.

Parameters:
  • file_name (str) – name of csv to save to database
  • table_name (str) – name of the database table to copy to
Returns:

None

Return type:

None

export_data_table(table, end_date, label, feature_names)[source]

Save a data set as an HDF table for later reuse.

Parameters:
  • table (pandas DataFrame) – the DataFrame to save
  • end_date (a date format of some kind) – end of labeling period
  • label (str) – name of the column containing labels
  • feature_names (list) – names of the columns containing features
Returns:

the prefix of the HDF filename

Return type:

str

export_metadata(end_date, label, feature_names)[source]

Construct and export metadata for a matrix. Return a unique identifier based on this metadata to used as a filename.

Parameters:
  • end_date (str) – the end date of the labeling period for the matrix
  • label (str) – name of the column containing labels
  • feature_names (list) – names of the columns containing features
Returns:

unique identifier for the matrix

Return type:

str

extract_train_x(feature_set, full_feature_table)[source]
generate_feature_group(feature_sets)[source]
generate_feature_group_combinations(feature_groups)[source]
generate_model_parameter_list()[source]
generate_uuid(metadata)[source]

Generate a unique identifier given a dictionary of matrix metadata.

Parameters:metadata (dict) – metadata for the matrix
Returns:unique name for the file
Return type:str
get_feature_sets(feature_names_dict)[source]
iterate_train_test(iterable)[source]

Iterate over prediction window start dates, returning the start dates for train and test data for the current model.

Parameters:prediction_window_start_dates – list of prediction window start dates
Type:list
Returns:train date and test date
Return type:
load_feature_name_dictionary()[source]
load_table(train_or_test, feature_timestamp)[source]
load_test_table()[source]
load_train_table()[source]
parameter_generator(params_lst)[source]
pickle_results(res_dict, clf)[source]

Pickle the model object locally, upload to s3, and delete local copy

Parameters:
  • self (FeatureModelGrabber) – inherit object properties
  • res_dict (dict) – dictionary of model information
  • clf (model) – model object
Returns:

path to pickle file

Return type:

str

run(labels)[source]
upload_file_to_s3(key_name, bucket, local_file_path)[source]
write_matrix_pairs(train_test_combos)[source]

Given a list of train-test pairs, write them locally, check s3 for an existing set, combine the sets, remove duplicates, and upload new copy to s3.

Parameters:train_test_combos (list) – list of dictionaries with keys ‘train’ and ‘test’ with filenames of HDF matrices as values
Returns:None
Return type:None
write_to_csv(df, column_order, file_name)[source]

Given a dataframe, a specific column order, and a csv filename, enforce the column order on the dataframe and then append the data to the specified csv file.

Parameters:
  • self (FeatureModelGrabber) – inherit object properties
  • df – data output by the modeling process containing either model information or predictions
  • column_order (list) – the order of columns in the relevant database table
  • file_name (str) – the name of the csv file to write to
Returns:

None

Return type:

None

pipeline.modeling.feature_model_grabber.chunker(seq, size)[source]
pipeline.modeling.feature_model_grabber.write_dataframe_chunks(df_name, df)[source]

pipeline.modeling.models module

class pipeline.modeling.models.ConfigError[source]
class pipeline.modeling.models.Model(model_name, model_params, label, training_data, testing_data, cols_to_use, config)[source]
compute_confusion_matrix(predicted_labels, labels)[source]
define_model(model, parameters, n_cores=0)[source]
get_data(df, undersample=False)[source]
get_feature_importance(clf, model_name)[source]
get_test_data()[source]
get_training_data()[source]
run()[source]

pipeline.modeling.run module

pipeline.modeling.run.main(config_file_name, feature_timestamp, discard_models)[source]

Replaces template placeholder with values.

Parameters:config_file_name (str) – path to config yaml file
Returns:None – always returns None as default
Return type:None

Module contents