Experiment Management


Core Concepts

Checkpoint Root Directory (ckpt_root_dir)

  • The main directory where all experiment outputs are housed.

Experiments (experiment_name)

  • Symbolizes a distinct training recipe or configuration.
  • Alter the experiment_name for transparency when updating your training recipe.
  • Each training under the same experiment_name has its individual run directory, ensuring no overwrites.

Runs (run_id)

  • Every individual training session is termed as a run.
  • A unique run_id is generated for every training, regardless of identical parameters.
  • Different trainings under the same experiment_name maintain distinct logs and checkpoints, courtesy of their separate run directories.

File Structure of Experiments

├── <experiment_name>
│   │
│   ├─── <run_dir>
│   │     ├─ ckpt_best.pth                   # Best performance during validation
│   │     ├─ ckpt_latest.pth                 # End of the most recent epoch
│   │     ├─ average_model.pth               # Averaged over specified epochs
│   │     ├─ ckpt_epoch_*.pth                # Checkpoints from certain epochs (e.g., epoch 10, 15)
│   │     ├─ events.out.tfevents.*           # Tensorflow run artifacts
│   │     └─ log_<timestamp>.txt             # Trainer logs of that particular run
│   │
│   └─── <other_run_dir>
│        └─ ...
└─── <other_experiment_name>
    ├─── <run_dir>
    │     └─ ...
    └─── <another_run_dir>
          └─ ...


A. Get the absolute path of a run directory

Manually navigate using <ckpt_root_dir>/<experiment_name>/<run_dir> or utilize the following programmatic approach:

from super_gradients.common.environment.checkpoints_dir_utils import get_checkpoints_dir_path

checkpoints_dir_path = get_checkpoints_dir_path(experiment_name="<experiment_name>", run_id="<run_id>")

B. Get the latest run id

from super_gradients.common.environment.checkpoints_dir_utils import get_latest_run_id

run_id = get_latest_run_id(experiment_name="<experiment_name>")
Combine with the above utility to fetch the path of the latest run directory.

Next Steps: - Dive into the checkpoints tutorial to grasp the essence of checkpoints, enabling you to resume trainings or access checkpoints from prior runs. - The logs tutorial focuses on the log files stored in your run directories, offering insights into the training progression.