save progress of experiment steps
This applies only to RunEnvironment
and its inheritances.
- save datastore as pickle (on
__del__
): naming likecheckpoint_<class_name>.pkl
- query if class is already executed (on
__init__
) - skip, if already executed (directly go from
__init__
to__del__
) - force button for rerun (independently if checkpoint is available or not)
Background: This issue is required, if mlt is running on HPC systems and different partitions. E.g. experiment setup and preprocessing shall run on CPU-nodes (and on login-nodes because of the required internet connection), but the training step should be performed on the GPU partition. The post-processing (not evaluated yet, if GPU is required for bootstrap prediction and if it is actually faster) can be performed afterwards on CPU again.
-
implement checkpoint saving on local disk -
implement loading of checkpoints -
implement skipping of execution if checkpoint was loaded -
implement RunEnvironment
behaviour (when not called as inheritance): add force button, clean-up button -
check speed of postprocessing depending on partition (not really related, but interesting for the final setup: if postprocessing is run on CPU or GPU)