pipeline.modeling package¶
Submodules¶
pipeline.modeling.feature_model_grabber module¶
-
class
pipeline.modeling.feature_model_grabber.FeatureModelGrabber(fake_today, prediction_window, config, feature_timestamp, s3_profile, discard_model)[source]¶ -
-
compile_results(res, bulk_model_list, force_write=False)[source]¶ After a model is run, compile the model information and the predictions. Temporarily stash them in csvs. If more than 49 models have been stashed or this is the last model to be run, copy the csvs to the models and predictions table, remove the csvs, and return an empty list.
Parameters: - self (FeatureModelGrabber) – inherit object properties
- res (dict) – dictionary of model information
- bulk_model_list (list) – list of model info to be saved to database
- force_write (bool) – should the stashed info be saved to the database regardless of length?
Returns: list of models run since last write
Return type: list
-
connect_to_s3()[source]¶ Open a connection to s3 and return the resource objects and a dictionary of s3 configuration details.
Returns: s3 resource and s3_config Return type: boto3 resource and dict
-
csv_to_database(file_name, table_name)[source]¶ Given a csv name and a database table name, append the contents of the csv to the database table and remove the csv.
Parameters: - file_name (str) – name of csv to save to database
- table_name (str) – name of the database table to copy to
Returns: None
Return type: None
-
export_data_table(table, end_date, label, feature_names)[source]¶ Save a data set as an HDF table for later reuse.
Parameters: - table (pandas DataFrame) – the DataFrame to save
- end_date (a date format of some kind) – end of labeling period
- label (str) – name of the column containing labels
- feature_names (list) – names of the columns containing features
Returns: the prefix of the HDF filename
Return type: str
-
export_metadata(end_date, label, feature_names)[source]¶ Construct and export metadata for a matrix. Return a unique identifier based on this metadata to used as a filename.
Parameters: - end_date (str) – the end date of the labeling period for the matrix
- label (str) – name of the column containing labels
- feature_names (list) – names of the columns containing features
Returns: unique identifier for the matrix
Return type: str
-
generate_uuid(metadata)[source]¶ Generate a unique identifier given a dictionary of matrix metadata.
Parameters: metadata (dict) – metadata for the matrix Returns: unique name for the file Return type: str
-
iterate_train_test(iterable)[source]¶ Iterate over prediction window start dates, returning the start dates for train and test data for the current model.
Parameters: prediction_window_start_dates – list of prediction window start dates Type: list Returns: train date and test date Return type:
-
pickle_results(res_dict, clf)[source]¶ Pickle the model object locally, upload to s3, and delete local copy
Parameters: - self (FeatureModelGrabber) – inherit object properties
- res_dict (dict) – dictionary of model information
- clf (model) – model object
Returns: path to pickle file
Return type: str
-
write_matrix_pairs(train_test_combos)[source]¶ Given a list of train-test pairs, write them locally, check s3 for an existing set, combine the sets, remove duplicates, and upload new copy to s3.
Parameters: train_test_combos (list) – list of dictionaries with keys ‘train’ and ‘test’ with filenames of HDF matrices as values Returns: None Return type: None
-
write_to_csv(df, column_order, file_name)[source]¶ Given a dataframe, a specific column order, and a csv filename, enforce the column order on the dataframe and then append the data to the specified csv file.
Parameters: - self (FeatureModelGrabber) – inherit object properties
- df – data output by the modeling process containing either model information or predictions
- column_order (list) – the order of columns in the relevant database table
- file_name (str) – the name of the csv file to write to
Returns: None
Return type: None
-