Environment
pop_arg(arg_name, default_value=None)
Get the specified args and remove them from argv
Source code in V3_1/src/super_gradients/common/environment/argparse_utils.py
12 13 14 15 16 17 18 19 20 21 22 23 |
|
pop_local_rank()
Pop the python arg "local-rank". If exists inform the user with a log, otherwise return -1.
Source code in V3_1/src/super_gradients/common/environment/argparse_utils.py
26 27 28 29 30 31 |
|
add_params_to_cfg(cfg, params)
Add parameters to an existing config
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cfg |
DictConfig
|
OmegaConf config |
required |
params |
List[str]
|
List of parameters to add, in dotlist format (i.e. ["training_hyperparams.resume=True"]) |
required |
Source code in V3_1/src/super_gradients/common/environment/cfg_utils.py
88 89 90 91 92 93 94 |
|
load_arch_params(config_name, recipes_dir_path=None, overrides=None)
Load a single arch_params file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config_name |
str
|
Name of the yaml to load (e.g. "resnet18_cifar_arch_params") |
required |
recipes_dir_path |
Optional[str]
|
Optional. Main directory where every recipe are stored. (e.g. ../super_gradients/recipes) This directory should include a "arch_params" folder, which itself should include the config file named after config_name. |
None
|
overrides |
Optional[list]
|
List of hydra overrides for config file |
None
|
Source code in V3_1/src/super_gradients/common/environment/cfg_utils.py
129 130 131 132 133 134 135 136 137 |
|
load_dataset_params(config_name, recipes_dir_path=None, overrides=None)
Load a single dataset_params file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config_name |
str
|
Name of the yaml to load (e.g. "cifar10_dataset_params") |
required |
recipes_dir_path |
Optional[str]
|
Optional. Main directory where every recipe are stored. (e.g. ../super_gradients/recipes) This directory should include a "training_hyperparams" folder, which itself should include the config file named after config_name. |
None
|
overrides |
Optional[list]
|
List of hydra overrides for config file |
None
|
Source code in V3_1/src/super_gradients/common/environment/cfg_utils.py
151 152 153 154 155 156 157 158 159 |
|
load_experiment_cfg(experiment_name, ckpt_root_dir=None)
Load the hydra config associated to a specific experiment.
Background Information: every time an experiment is launched based on a recipe, all the hydra config params are stored in a hidden folder ".hydra". This hidden folder is used here to recreate the exact same config as the one that was used to launch the experiment (Also include hydra overrides).
The motivation is to be able to resume or evaluate an experiment with the exact same config as the one that was used when the experiment was initially started, regardless of any change that might have been introduced to the recipe, and also while using the same overrides that were used for that experiment.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
experiment_name |
str
|
Name of the experiment to resume |
required |
ckpt_root_dir |
str
|
Directory including the checkpoints |
None
|
Returns:
Type | Description |
---|---|
DictConfig
|
The config that was used for that experiment |
Source code in V3_1/src/super_gradients/common/environment/cfg_utils.py
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
|
load_recipe(config_name, recipes_dir_path=None, overrides=None)
Load a single a file of the recipe directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config_name |
str
|
Name of the yaml to load (e.g. "cifar10_resnet") |
required |
recipes_dir_path |
Optional[str]
|
Optional. Main directory where every recipe are stored. (e.g. ../super_gradients/recipes) This directory should include a folder corresponding to the subconfig, which itself should include the config file named after config_name. |
None
|
overrides |
Optional[list]
|
List of hydra overrides for config file |
None
|
Source code in V3_1/src/super_gradients/common/environment/cfg_utils.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
|
load_recipe_from_subconfig(config_name, config_type, recipes_dir_path=None, overrides=None)
Load a single a file (e.g. "resnet18_cifar_arch_params") stored in a subconfig (e.g. "arch_param") of the recipe directory,.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config_name |
str
|
Name of the yaml to load (e.g. "resnet18_cifar_arch_params") |
required |
config_type |
str
|
Type of the subconfig (e.g. "arch_params") |
required |
recipes_dir_path |
Optional[str]
|
Optional. Main directory where every recipe are stored. (e.g. ../super_gradients/recipes) This directory should include a folder corresponding to the subconfig, which itself should include the config file named after config_name. |
None
|
overrides |
Optional[list]
|
List of hydra overrides for config file |
None
|
Source code in V3_1/src/super_gradients/common/environment/cfg_utils.py
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 |
|
load_training_hyperparams(config_name, recipes_dir_path=None, overrides=None)
Load a single training_hyperparams file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config_name |
str
|
Name of the yaml to load (e.g. "cifar10_resnet_train_params") |
required |
recipes_dir_path |
Optional[str]
|
Optional. Main directory where every recipe are stored. (e.g. ../super_gradients/recipes) This directory should include a "training_hyperparams" folder, which itself should include the config file named after config_name. |
None
|
overrides |
Optional[list]
|
List of hydra overrides for config file |
None
|
Source code in V3_1/src/super_gradients/common/environment/cfg_utils.py
140 141 142 143 144 145 146 147 148 |
|
override_cfg(cfg, overrides)
Override inplace a config with a list of hydra overrides
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cfg |
DictConfig
|
OmegaConf config |
required |
overrides |
Union[DictConfig, Dict[str, Any]]
|
Dictionary like object that will be used to override cfg |
required |
Source code in V3_1/src/super_gradients/common/environment/cfg_utils.py
162 163 164 165 166 167 168 |
|
get_checkpoints_dir_path(experiment_name, ckpt_root_dir=None)
Get the directory that includes all the checkpoints (and logs) of an experiment.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
experiment_name |
str
|
Name of the experiment. |
required |
ckpt_root_dir |
str
|
Path to the directory where all the experiments are organised, each sub-folder representing a specific experiment. If None, SG will first check if a package named 'checkpoints' exists. If not, SG will look for the root of the project that includes the script that was launched. If not found, raise an error. |
None
|
Returns:
Type | Description |
---|---|
str
|
Path of folder where the experiment checkpoints and logs will be stored. |
Source code in V3_1/src/super_gradients/common/environment/checkpoints_dir_utils.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
|
get_ckpt_local_path(experiment_name, ckpt_name, external_checkpoint_path, ckpt_root_dir=None)
Gets the local path to the checkpoint file, which will be: - By default: YOUR_REPO_ROOT/super_gradients/checkpoints/experiment_name/ckpt_name. - external_checkpoint_path when external_checkpoint_path != None - ckpt_root_dir/experiment_name/ckpt_name when ckpt_root_dir != None. - if the checkpoint file is remotely located: when overwrite_local_checkpoint=True then it will be saved in a temporary path which will be returned, otherwise it will be downloaded to YOUR_REPO_ROOT/super_gradients/checkpoints/experiment_name and overwrite YOUR_REPO_ROOT/super_gradients/checkpoints/experiment_name/ckpt_name if such file exists.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
experiment_name |
str
|
experiment name attr in trainer :param ckpt_name: checkpoint filename |
required |
external_checkpoint_path |
str
|
full path to checkpoint file (that might be located outside of super_gradients/checkpoints directory) |
required |
ckpt_root_dir |
str
|
Local root directory path where all experiment logging directories will reside. When None, it is assumed that pkg_resources.resource_filename( 'checkpoints', "") exists and will be used. :return: local path of the checkpoint file (Str) |
None
|
Source code in V3_1/src/super_gradients/common/environment/checkpoints_dir_utils.py
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
|
get_project_checkpoints_dir_path()
Get the checkpoints' directory that is at the root of the users project. Create it if it doesn't exist. Return None if root not found.
Source code in V3_1/src/super_gradients/common/environment/checkpoints_dir_utils.py
32 33 34 35 36 37 38 39 40 41 42 |
|
find_free_port()
Find an available port of current machine/node. Note: there is still a chance the port could be taken by other processes.
Source code in V3_1/src/super_gradients/common/environment/ddp_utils.py
71 72 73 74 75 76 77 78 79 |
|
init_trainer()
Initialize the super_gradients environment.
This function should be the first thing to be called by any code running super_gradients.
Source code in V3_1/src/super_gradients/common/environment/ddp_utils.py
10 11 12 13 14 15 16 17 |
|
is_distributed()
Check if current process is a DDP subprocess.
Source code in V3_1/src/super_gradients/common/environment/ddp_utils.py
20 21 22 |
|
is_launched_using_sg()
Check if the current process is a subprocess launched using SG restart_script_with_ddp
Source code in V3_1/src/super_gradients/common/environment/ddp_utils.py
25 26 27 |
|
is_main_process()
Check if current process is considered as the main process (i.e. is responsible for sanity check, atexit upload, ...). The definition ensures that 1 and only 1 process follows this condition, regardless of how the run was started.
The rule is as follow: - If not DDP: main process is current process - If DDP launched using SuperGradients: main process is the launching process (rank=-1) - If DDP launched with torch: main process is rank 0
Source code in V3_1/src/super_gradients/common/environment/ddp_utils.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
|
multi_process_safe(func)
A decorator for making sure a function runs only in main process. If not in DDP mode (local_rank = -1), the function will run. If in DDP mode, the function will run only in the main process (local_rank = 0) This works only for functions with no return value
Source code in V3_1/src/super_gradients/common/environment/ddp_utils.py
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
|
EnvironmentVariables
Class to dynamically get any environment variables.
Source code in V3_1/src/super_gradients/common/environment/env_variables.py
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
|
get_cpu_percent()
Average of all the CPU utilization.
Source code in V3_1/src/super_gradients/common/environment/monitoring/cpu.py
4 5 6 |
|
GPUStatAggregatorIterator
dataclass
Iterator of multiple StatAggregator, that accumulate samples and aggregates them for each NVIDIA device.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
Name of the statistic |
required |
sampling_fn |
How the statistic is sampled |
required | |
aggregate_fn |
How the statistic samples are aggregated |
required |
Source code in V3_1/src/super_gradients/common/environment/monitoring/data_models.py
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
|
__iter__()
Iterate over the StatAggregator of each node
Source code in V3_1/src/super_gradients/common/environment/monitoring/data_models.py
66 67 68 |
|
__post_init__()
Initialize nvidia_management_lib and create a list of StatAggregator, one for each NVIDIA device.
Source code in V3_1/src/super_gradients/common/environment/monitoring/data_models.py
58 59 60 61 62 63 64 |
|
StatAggregator
dataclass
Accumulate statistics samples and aggregates them.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
Name of the statistic |
required |
sampling_fn |
Callable
|
How the statistic is sampled |
required |
aggregate_fn |
Callable[[List[Any], float], float]
|
How the statistic samples are aggregated, has to take "samples: List[Any]" and "time: float" as parameters |
required |
reset_callback_fn |
Optional[Callable]
|
Optional, can be used to reset any system metric |
None
|
Source code in V3_1/src/super_gradients/common/environment/monitoring/data_models.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
|
get_disk_usage_percent()
Disk memory used in percent.
Source code in V3_1/src/super_gradients/common/environment/monitoring/disk.py
9 10 11 |
|
get_io_read_mb()
Number of MegaBytes read since import
Source code in V3_1/src/super_gradients/common/environment/monitoring/disk.py
14 15 16 |
|
get_io_write_mb()
Number of MegaBytes written since import
Source code in V3_1/src/super_gradients/common/environment/monitoring/disk.py
19 20 21 |
|
reset_io_read()
Reset the value of net_io_counters
Source code in V3_1/src/super_gradients/common/environment/monitoring/disk.py
24 25 26 27 |
|
reset_io_write()
Reset the value of net_io_counters
Source code in V3_1/src/super_gradients/common/environment/monitoring/disk.py
30 31 32 33 |
|
count_gpus()
Count how many GPUS NVDIA detects.
Source code in V3_1/src/super_gradients/common/environment/monitoring/gpu/gpu.py
18 19 20 |
|
get_device_memory_allocated_percent(gpu_index)
GPU memory allocated in percent of a given GPU.
Source code in V3_1/src/super_gradients/common/environment/monitoring/gpu/gpu.py
34 35 36 37 38 |
|
get_device_memory_usage_percent(gpu_index)
GPU memory utilization in percent of a given GPU.
Source code in V3_1/src/super_gradients/common/environment/monitoring/gpu/gpu.py
28 29 30 31 |
|
get_device_power_usage_percent(gpu_index)
GPU power usage in percent of a given GPU.
Source code in V3_1/src/super_gradients/common/environment/monitoring/gpu/gpu.py
59 60 61 62 63 64 |
|
get_device_power_usage_w(gpu_index)
GPU power usage in Watts of a given GPU.
Source code in V3_1/src/super_gradients/common/environment/monitoring/gpu/gpu.py
53 54 55 56 |
|
get_device_temperature_c(gpu_index)
GPU temperature in Celsius of a given GPU.
Source code in V3_1/src/super_gradients/common/environment/monitoring/gpu/gpu.py
47 48 49 50 |
|
get_device_usage_percent(gpu_index)
GPU utilization in percent of a given GPU.
Source code in V3_1/src/super_gradients/common/environment/monitoring/gpu/gpu.py
41 42 43 44 |
|
get_handle_by_index(gpu_index)
Get the device handle of a given GPU.
Source code in V3_1/src/super_gradients/common/environment/monitoring/gpu/gpu.py
23 24 25 |
|
init_nvidia_management_lib()
Initialize nvml (NVDIA management library), which is required to use pynvml.
Source code in V3_1/src/super_gradients/common/environment/monitoring/gpu/gpu.py
13 14 15 |
|
safe_init_nvidia_management_lib()
Initialize nvml (NVDIA management library), which is required to use pynvml. Return True on success.
Source code in V3_1/src/super_gradients/common/environment/monitoring/gpu/gpu.py
4 5 6 7 8 9 10 |
|
NVML_VALUE_NOT_AVAILABLE_uint = c_uint(-1)
module-attribute
Field Identifiers.
All Identifiers pertain to a device. Each ID is only used once and is guaranteed never to change.
NVMLError
Bases: Exception
Source code in V3_1/src/super_gradients/common/environment/monitoring/gpu/pynvml.py
622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 |
|
__new__(typ, value)
Maps value to a proper subclass of NVMLError. See _extractNVMLErrorsAsClasses function for more details
Source code in V3_1/src/super_gradients/common/environment/monitoring/gpu/pynvml.py
648 649 650 651 652 653 654 655 656 657 |
|
SystemMonitor
Monitor and write to tensorboard the system statistics, such as CPU usage, GPU, ...
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tensorboard_writer |
SummaryWriter
|
Tensorboard object that will be used to save the statistics |
required |
extra_gpu_stats |
bool
|
Set to True to get extra gpu statistics, such as gpu temperature, power usage, ... Default set to False, because this reduces the tensorboard readability. |
False
|
Source code in V3_1/src/super_gradients/common/environment/monitoring/monitoring.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
|
start(tensorboard_writer)
classmethod
Instantiate a SystemMonitor in a multiprocess safe way.
Source code in V3_1/src/super_gradients/common/environment/monitoring/monitoring.py
101 102 103 104 105 |
|
get_network_recv_mb()
Number of MegaBytes received since import
Source code in V3_1/src/super_gradients/common/environment/monitoring/network.py
13 14 15 |
|
get_network_sent_mb()
Number of MegaBytes sent since import
Source code in V3_1/src/super_gradients/common/environment/monitoring/network.py
8 9 10 |
|
reset_network_recv()
Reset the value of net_io_counters
Source code in V3_1/src/super_gradients/common/environment/monitoring/network.py
24 25 26 27 |
|
reset_network_sent()
Reset the value of net_io_counters
Source code in V3_1/src/super_gradients/common/environment/monitoring/network.py
18 19 20 21 |
|
average(samples, time_diff)
Average a list of values, return None if empty list
Source code in V3_1/src/super_gradients/common/environment/monitoring/utils.py
4 5 6 |
|
bytes_to_megabytes(b)
Convert bytes to megabytes
Source code in V3_1/src/super_gradients/common/environment/monitoring/utils.py
14 15 16 17 |
|
delta_per_s(samples, time_diff)
Compute the difference per second (ex. megabytes per second), return None if empty list
Source code in V3_1/src/super_gradients/common/environment/monitoring/utils.py
9 10 11 |
|
virtual_memory_used_percent()
Virtual memory used in percent.
Source code in V3_1/src/super_gradients/common/environment/monitoring/virtual_memory.py
4 5 6 |
|
RecipeShortcutsCallback
Bases: Callback
Interpolates the shortcuts defined in variable_set.yaml: lr batch_size val_batch_size ema epochs resume: False num_workers
When any of the above are not set, they will be populated with the original values (for example config.lr will be set with config.training_hyperparams.initial_lr) for clarity in logs.
Source code in V3_1/src/super_gradients/common/environment/omegaconf_utils.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
|
get_cls(cls_path)
A resolver for Hydra/OmegaConf to allow getting a class instead on an instance. usage: class_of_optimizer: ${class:torch.optim.Adam}
Source code in V3_1/src/super_gradients/common/environment/omegaconf_utils.py
62 63 64 65 66 67 68 69 70 71 |
|
register_hydra_resolvers()
Register all the hydra resolvers required for the super-gradients recipes.
Source code in V3_1/src/super_gradients/common/environment/omegaconf_utils.py
78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
|
normalize_path(path)
Normalize the directory of file path. Replace the Windows-style () path separators with unix ones (/). This is necessary when running on Windows since Hydra compose fails to find a configuration file is the config directory contains backward slash symbol.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
Input path string |
required |
Returns:
Type | Description |
---|---|
str
|
Output path string with all \ symbols replaces with /. |
Source code in V3_1/src/super_gradients/common/environment/path_utils.py
1 2 3 4 5 6 7 8 9 |
|