Experiment Management

Outline

Core Concepts
Checkpoint Root Directory
Experiments
Runs
File Structure of Experiments
Utilities for Experiment Management
Get the Absolute Path of a Run Directory
Retrieve the Latest Run ID

Core Concepts

Checkpoint Root Directory (`ckpt_root_dir`)

The main directory where all experiment outputs are housed.

Experiments (`experiment_name`)

Symbolizes a distinct training recipe or configuration.
Alter the experiment_name for transparency when updating your training recipe.
Each training under the same experiment_name has its individual run directory, ensuring no overwrites.

Runs (`run_id`)

Every individual training session is termed as a run.
A unique run_id is generated for every training, regardless of identical parameters.
Different trainings under the same experiment_name maintain distinct logs and checkpoints, courtesy of their separate run directories.

File Structure of Experiments

<ckpt_root_dir>
│
├── <experiment_name>
│   │
│   ├─── <run_dir>
│   │     ├─ ckpt_best.pth                   # Best performance during validation
│   │     ├─ ckpt_latest.pth                 # End of the most recent epoch
│   │     ├─ average_model.pth               # Averaged over specified epochs
│   │     ├─ ckpt_epoch_*.pth                # Checkpoints from certain epochs (e.g., epoch 10, 15)
│   │     ├─ events.out.tfevents.*           # Tensorflow run artifacts
│   │     └─ log_<timestamp>.txt             # Trainer logs of that particular run
│   │
│   └─── <other_run_dir>
│        └─ ...
│
└─── <other_experiment_name>
    │
    ├─── <run_dir>
    │     └─ ...
    │
    └─── <another_run_dir>
          └─ ...

Utilities

A. Get the absolute path of a run directory

Manually navigate using <ckpt_root_dir>/<experiment_name>/<run_dir> or utilize the following programmatic approach:

from super_gradients.common.environment.checkpoints_dir_utils import get_checkpoints_dir_path

checkpoints_dir_path = get_checkpoints_dir_path(experiment_name="<experiment_name>", run_id="<run_id>")

B. Get the latest run id

from super_gradients.common.environment.checkpoints_dir_utils import get_latest_run_id

run_id = get_latest_run_id(experiment_name="<experiment_name>")

Combine with the above utility to fetch the path of the latest run directory.

Next Steps: - Dive into the checkpoints tutorial to grasp the essence of checkpoints, enabling you to resume trainings or access checkpoints from prior runs. - The logs tutorial focuses on the log files stored in your run directories, offering insights into the training progression.