Experiment Management
Outline
- Core Concepts
- Checkpoint Root Directory
- Experiments
- Runs
- File Structure of Experiments
- Utilities for Experiment Management
- Get the Absolute Path of a Run Directory
- Retrieve the Latest Run ID
Core Concepts
Checkpoint Root Directory (ckpt_root_dir
)
- The main directory where all experiment outputs are housed.
Experiments (experiment_name
)
- Symbolizes a distinct training recipe or configuration.
- Alter the
experiment_name
for transparency when updating your training recipe. - Each training under the same
experiment_name
has its individualrun
directory, ensuring no overwrites.
Runs (run_id
)
- Every individual training session is termed as a
run
. - A unique
run_id
is generated for every training, regardless of identical parameters. - Different trainings under the same
experiment_name
maintain distinct logs and checkpoints, courtesy of their separate run directories.
File Structure of Experiments
<ckpt_root_dir>
│
├── <experiment_name>
│ │
│ ├─── <run_dir>
│ │ ├─ ckpt_best.pth # Best performance during validation
│ │ ├─ ckpt_latest.pth # End of the most recent epoch
│ │ ├─ average_model.pth # Averaged over specified epochs
│ │ ├─ ckpt_epoch_*.pth # Checkpoints from certain epochs (e.g., epoch 10, 15)
│ │ ├─ events.out.tfevents.* # Tensorflow run artifacts
│ │ └─ log_<timestamp>.txt # Trainer logs of that particular run
│ │
│ └─── <other_run_dir>
│ └─ ...
│
└─── <other_experiment_name>
│
├─── <run_dir>
│ └─ ...
│
└─── <another_run_dir>
└─ ...
Utilities
A. Get the absolute path of a run directory
Manually navigate using <ckpt_root_dir>/<experiment_name>/<run_dir>
or utilize the following programmatic approach:
from super_gradients.common.environment.checkpoints_dir_utils import get_checkpoints_dir_path
checkpoints_dir_path = get_checkpoints_dir_path(experiment_name="<experiment_name>", run_id="<run_id>")
B. Get the latest run id
from super_gradients.common.environment.checkpoints_dir_utils import get_latest_run_id
run_id = get_latest_run_id(experiment_name="<experiment_name>")
Next Steps: - Dive into the checkpoints tutorial to grasp the essence of checkpoints, enabling you to resume trainings or access checkpoints from prior runs. - The logs tutorial focuses on the log files stored in your run directories, offering insights into the training progression.