Environment
pop_arg(arg_name, default_value=None)
Get the specified args and remove them from argv
Source code in latest/src/super_gradients/common/environment/argparse_utils.py
12 13 14 15 16 17 18 19 20 21 22 23 |
|
pop_local_rank()
Pop the python arg "local-rank". If exists inform the user with a log, otherwise return -1.
Source code in latest/src/super_gradients/common/environment/argparse_utils.py
26 27 28 29 30 31 |
|
add_params_to_cfg(cfg, params)
Add parameters to an existing config
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cfg |
DictConfig
|
OmegaConf config |
required |
params |
List[str]
|
List of parameters to add, in dotlist format (i.e. ["training_hyperparams.resume=True"]) |
required |
Source code in latest/src/super_gradients/common/environment/cfg_utils.py
89 90 91 92 93 94 95 |
|
load_arch_params(config_name, recipes_dir_path=None, overrides=None)
Load a single arch_params file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config_name |
str
|
Name of the yaml to load (e.g. "resnet18_cifar_arch_params") |
required |
recipes_dir_path |
Optional[str]
|
Optional. Main directory where every recipe are stored. (e.g. ../super_gradients/recipes) This directory should include a "arch_params" folder, which itself should include the config file named after config_name. |
None
|
overrides |
Optional[list]
|
List of hydra overrides for config file |
None
|
Source code in latest/src/super_gradients/common/environment/cfg_utils.py
130 131 132 133 134 135 136 137 138 |
|
load_dataset_params(config_name, recipes_dir_path=None, overrides=None)
Load a single dataset_params file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config_name |
str
|
Name of the yaml to load (e.g. "cifar10_dataset_params") |
required |
recipes_dir_path |
Optional[str]
|
Optional. Main directory where every recipe are stored. (e.g. ../super_gradients/recipes) This directory should include a "training_hyperparams" folder, which itself should include the config file named after config_name. |
None
|
overrides |
Optional[list]
|
List of hydra overrides for config file |
None
|
Source code in latest/src/super_gradients/common/environment/cfg_utils.py
152 153 154 155 156 157 158 159 160 |
|
load_experiment_cfg(experiment_name, ckpt_root_dir=None, run_id=None)
Load the hydra config associated to a specific experiment.
Background Information: every time an experiment is launched based on a recipe, all the hydra config params are stored in a hidden folder ".hydra". This hidden folder is used here to recreate the exact same config as the one that was used to launch the experiment (Also include hydra overrides).
The motivation is to be able to resume or evaluate an experiment with the exact same config as the one that was used when the experiment was initially started, regardless of any change that might have been introduced to the recipe, and also while using the same overrides that were used for that experiment.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
experiment_name |
str
|
Name of the experiment to resume |
required |
ckpt_root_dir |
Optional[str]
|
Directory including the checkpoints |
None
|
run_id |
Optional[str]
|
Optional. Run id of the experiment. If None, the most recent run will be loaded. |
None
|
Returns:
Type | Description |
---|---|
DictConfig
|
The config that was used for that experiment |
Source code in latest/src/super_gradients/common/environment/cfg_utils.py
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
|
load_recipe(config_name, recipes_dir_path=None, overrides=None)
Load a single a file of the recipe directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config_name |
str
|
Name of the yaml to load (e.g. "cifar10_resnet") |
required |
recipes_dir_path |
Optional[str]
|
Optional. Main directory where every recipe are stored. (e.g. ../super_gradients/recipes) This directory should include a folder corresponding to the subconfig, which itself should include the config file named after config_name. |
None
|
overrides |
Optional[list]
|
List of hydra overrides for config file |
None
|
Source code in latest/src/super_gradients/common/environment/cfg_utils.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
|
load_recipe_from_subconfig(config_name, config_type, recipes_dir_path=None, overrides=None)
Load a single a file (e.g. "resnet18_cifar_arch_params") stored in a subconfig (e.g. "arch_param") of the recipe directory,.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config_name |
str
|
Name of the yaml to load (e.g. "resnet18_cifar_arch_params") |
required |
config_type |
str
|
Type of the subconfig (e.g. "arch_params") |
required |
recipes_dir_path |
Optional[str]
|
Optional. Main directory where every recipe are stored. (e.g. ../super_gradients/recipes) This directory should include a folder corresponding to the subconfig, which itself should include the config file named after config_name. |
None
|
overrides |
Optional[list]
|
List of hydra overrides for config file |
None
|
Source code in latest/src/super_gradients/common/environment/cfg_utils.py
98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
|
load_training_hyperparams(config_name, recipes_dir_path=None, overrides=None)
Load a single training_hyperparams file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config_name |
str
|
Name of the yaml to load (e.g. "cifar10_resnet_train_params") |
required |
recipes_dir_path |
Optional[str]
|
Optional. Main directory where every recipe are stored. (e.g. ../super_gradients/recipes) This directory should include a "training_hyperparams" folder, which itself should include the config file named after config_name. |
None
|
overrides |
Optional[list]
|
List of hydra overrides for config file |
None
|
Source code in latest/src/super_gradients/common/environment/cfg_utils.py
141 142 143 144 145 146 147 148 149 |
|
override_cfg(cfg, overrides)
Override inplace a config with a list of hydra overrides
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cfg |
DictConfig
|
OmegaConf config |
required |
overrides |
Union[DictConfig, Dict[str, Any]]
|
Dictionary like object that will be used to override cfg |
required |
Source code in latest/src/super_gradients/common/environment/cfg_utils.py
163 164 165 166 167 168 169 |
|
generate_run_id()
Generate a unique run ID based on the current timestamp.
Returns:
Type | Description |
---|---|
str
|
Unique run ID. in the format "RUN_ |
Source code in latest/src/super_gradients/common/environment/checkpoints_dir_utils.py
20 21 22 23 24 25 26 |
|
get_checkpoints_dir_path(experiment_name, ckpt_root_dir=None, run_id=None)
Get the directory that includes all the checkpoints (and logs) of an experiment. ckpt_root_dir - experiment_name - run_id - ... - ...
Parameters:
Name | Type | Description | Default |
---|---|---|---|
experiment_name |
str
|
Name of the experiment. |
required |
ckpt_root_dir |
Optional[str]
|
Path to the directory where all the experiments are organised, each sub-folder representing a specific experiment. If None, SG will first check if a package named 'checkpoints' exists. If not, SG will look for the root of the project that includes the script that was launched. If not found, raise an error. |
None
|
run_id |
Optional[str]
|
Optional. Run id of the experiment. If None, the most recent run will be loaded. |
None
|
Returns:
Type | Description |
---|---|
str
|
Path of folder where the experiment checkpoints and logs will be stored. |
Source code in latest/src/super_gradients/common/environment/checkpoints_dir_utils.py
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
|
get_ckpt_local_path(experiment_name, ckpt_name, external_checkpoint_path, ckpt_root_dir=None, run_id=None)
Gets the local path to the checkpoint file, which will be: - By default: YOUR_REPO_ROOT/super_gradients/checkpoints/experiment_name/ckpt_name. - external_checkpoint_path when external_checkpoint_path != None - ckpt_root_dir/experiment_name/ckpt_name when ckpt_root_dir != None. - if the checkpoint file is remotely located: when overwrite_local_checkpoint=True then it will be saved in a temporary path which will be returned, otherwise it will be downloaded to YOUR_REPO_ROOT/super_gradients/checkpoints/experiment_name and overwrite YOUR_REPO_ROOT/super_gradients/checkpoints/experiment_name/ckpt_name if such file exists.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
experiment_name |
str
|
Name of the experiment. |
required |
ckpt_name |
str
|
Checkpoint filename |
required |
external_checkpoint_path |
str
|
Full path to checkpoint file (that might be located outside of super_gradients/checkpoints directory) |
required |
ckpt_root_dir |
str
|
Path to the directory where all the experiments are organised, each sub-folder representing a specific experiment. If None, SG will first check if a package named 'checkpoints' exists. If not, SG will look for the root of the project that includes the script that was launched. If not found, raise an error. |
None
|
run_id |
Optional[str]
|
Optional. Run id of the experiment. If None, the most recent run will be loaded. |
None
|
Returns:
Type | Description |
---|---|
str
|
Path of folder where the experiment checkpoints and logs will be stored. :return: local path of the checkpoint file (Str) |
Source code in latest/src/super_gradients/common/environment/checkpoints_dir_utils.py
126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
|
get_latest_run_id(experiment_name, checkpoints_root_dir=None)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
experiment_name |
str
|
Name of the experiment. |
required |
checkpoints_root_dir |
Optional[str]
|
Path to the directory where all the experiments are organised, each sub-folder representing a specific experiment. If None, SG will first check if a package named 'checkpoints' exists. If not, SG will look for the root of the project that includes the script that was launched. If not found, raise an error. |
None
|
Returns:
Type | Description |
---|---|
Optional[str]
|
Latest valid run ID. in the format "RUN_ |
Source code in latest/src/super_gradients/common/environment/checkpoints_dir_utils.py
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
|
get_project_checkpoints_dir_path()
Get the checkpoints' directory that is at the root of the users project. Create it if it doesn't exist. Return None if root not found.
Source code in latest/src/super_gradients/common/environment/checkpoints_dir_utils.py
83 84 85 86 87 88 89 90 91 92 93 |
|
is_run_dir(dirname)
Check if a directory is a run directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dirname |
str
|
Directory name. |
required |
Returns:
Type | Description |
---|---|
bool
|
True if the directory is a run directory, False otherwise. |
Source code in latest/src/super_gradients/common/environment/checkpoints_dir_utils.py
29 30 31 32 33 34 35 |
|
broadcast_from_master(data)
Broadcast data from master node to all other nodes. This may be required when you want to compute something only on master node (e.g computational-heavy metric) and don't want to waste CPU of other nodes doing the same work simultaneously.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
Any
|
Data to be broadcasted from master node (rank 0) |
required |
Returns:
Type | Description |
---|---|
Any
|
Data from rank 0 node |
Source code in latest/src/super_gradients/common/environment/ddp_utils.py
157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
|
execute_and_distribute_from_master(func)
Decorator to execute a function on the master process and distribute the result to all other processes. Useful in parallel computing scenarios where a computational task needs to be performed only on the master node (e.g., a computational-heavy calculation), and the result must be shared with other nodes without redundant computation.
Example usage: >>> @execute_and_distribute_from_master >>> def some_code_to_run(param1, param2): >>> return param1 + param2
The wrapped function will only be executed on the master node, and the result will be propagated to all other nodes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
func |
Callable[..., Any]
|
The function to be executed on the master process and whose result is to be distributed. |
required |
Returns:
Type | Description |
---|---|
Callable[..., Any]
|
A wrapper function that encapsulates the execute-and-distribute logic. |
Source code in latest/src/super_gradients/common/environment/ddp_utils.py
124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
|
find_free_port()
Find an available port of current machine/node. Note: there is still a chance the port could be taken by other processes.
Source code in latest/src/super_gradients/common/environment/ddp_utils.py
75 76 77 78 79 80 81 82 83 |
|
get_local_rank()
Returns the local rank if running in DDP, and 0 otherwise
Returns:
Type | Description |
---|---|
local rank |
Source code in latest/src/super_gradients/common/environment/ddp_utils.py
86 87 88 89 90 91 |
|
get_world_size()
Returns the world size if running in DDP, and 1 otherwise
Returns:
Type | Description |
---|---|
int
|
world size |
Source code in latest/src/super_gradients/common/environment/ddp_utils.py
104 105 106 107 108 109 110 111 112 113 |
|
init_trainer()
Initialize the super_gradients environment.
This function should be the first thing to be called by any code running super_gradients.
Source code in latest/src/super_gradients/common/environment/ddp_utils.py
14 15 16 17 18 19 20 21 |
|
is_distributed()
Check if current process is a DDP subprocess.
Source code in latest/src/super_gradients/common/environment/ddp_utils.py
24 25 26 |
|
is_launched_using_sg()
Check if the current process is a subprocess launched using SG restart_script_with_ddp
Source code in latest/src/super_gradients/common/environment/ddp_utils.py
29 30 31 |
|
is_main_process()
Check if current process is considered as the main process (i.e. is responsible for sanity check, atexit upload, ...). The definition ensures that 1 and only 1 process follows this condition, regardless of how the run was started.
The rule is as follow: - If not DDP: main process is current process - If DDP launched using SuperGradients: main process is the launching process (rank=-1) - If DDP launched with torch: main process is rank 0
Source code in latest/src/super_gradients/common/environment/ddp_utils.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
|
multi_process_safe(func)
A decorator for making sure a function runs only in main process. If not in DDP mode (local_rank = -1), the function will run. If in DDP mode, the function will run only in the main process (local_rank = 0) This works only for functions with no return value
Source code in latest/src/super_gradients/common/environment/ddp_utils.py
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
|
EnvironmentVariables
Class to dynamically get any environment variables.
Source code in latest/src/super_gradients/common/environment/env_variables.py
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
|
get_cpu_percent()
Average of all the CPU utilization.
Source code in latest/src/super_gradients/common/environment/monitoring/cpu.py
4 5 6 |
|
GPUStatAggregatorIterator
dataclass
Iterator of multiple StatAggregator, that accumulate samples and aggregates them for each NVIDIA device.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
Name of the statistic |
required |
sampling_fn |
How the statistic is sampled |
required | |
aggregate_fn |
How the statistic samples are aggregated |
required |
Source code in latest/src/super_gradients/common/environment/monitoring/data_models.py
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
|
__iter__()
Iterate over the StatAggregator of each node
Source code in latest/src/super_gradients/common/environment/monitoring/data_models.py
66 67 68 |
|
__post_init__()
Initialize nvidia_management_lib and create a list of StatAggregator, one for each NVIDIA device.
Source code in latest/src/super_gradients/common/environment/monitoring/data_models.py
58 59 60 61 62 63 64 |
|
StatAggregator
dataclass
Accumulate statistics samples and aggregates them.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
Name of the statistic |
required |
sampling_fn |
Callable
|
How the statistic is sampled |
required |
aggregate_fn |
Callable[[List[Any], float], float]
|
How the statistic samples are aggregated, has to take "samples: List[Any]" and "time: float" as parameters |
required |
reset_callback_fn |
Optional[Callable]
|
Optional, can be used to reset any system metric |
None
|
Source code in latest/src/super_gradients/common/environment/monitoring/data_models.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
|
get_disk_usage_percent()
Disk memory used in percent.
Source code in latest/src/super_gradients/common/environment/monitoring/disk.py
9 10 11 |
|
get_io_read_mb()
Number of MegaBytes read since import
Source code in latest/src/super_gradients/common/environment/monitoring/disk.py
14 15 16 |
|
get_io_write_mb()
Number of MegaBytes written since import
Source code in latest/src/super_gradients/common/environment/monitoring/disk.py
19 20 21 |
|
reset_io_read()
Reset the value of net_io_counters
Source code in latest/src/super_gradients/common/environment/monitoring/disk.py
24 25 26 27 |
|
reset_io_write()
Reset the value of net_io_counters
Source code in latest/src/super_gradients/common/environment/monitoring/disk.py
30 31 32 33 |
|
count_gpus()
Count how many GPUS NVDIA detects.
Source code in latest/src/super_gradients/common/environment/monitoring/gpu/gpu.py
18 19 20 |
|
get_device_memory_allocated_percent(gpu_index)
GPU memory allocated in percent of a given GPU.
Source code in latest/src/super_gradients/common/environment/monitoring/gpu/gpu.py
34 35 36 37 38 |
|
get_device_memory_usage_percent(gpu_index)
GPU memory utilization in percent of a given GPU.
Source code in latest/src/super_gradients/common/environment/monitoring/gpu/gpu.py
28 29 30 31 |
|
get_device_power_usage_percent(gpu_index)
GPU power usage in percent of a given GPU.
Source code in latest/src/super_gradients/common/environment/monitoring/gpu/gpu.py
59 60 61 62 63 64 |
|
get_device_power_usage_w(gpu_index)
GPU power usage in Watts of a given GPU.
Source code in latest/src/super_gradients/common/environment/monitoring/gpu/gpu.py
53 54 55 56 |
|
get_device_temperature_c(gpu_index)
GPU temperature in Celsius of a given GPU.
Source code in latest/src/super_gradients/common/environment/monitoring/gpu/gpu.py
47 48 49 50 |
|
get_device_usage_percent(gpu_index)
GPU utilization in percent of a given GPU.
Source code in latest/src/super_gradients/common/environment/monitoring/gpu/gpu.py
41 42 43 44 |
|
get_handle_by_index(gpu_index)
Get the device handle of a given GPU.
Source code in latest/src/super_gradients/common/environment/monitoring/gpu/gpu.py
23 24 25 |
|
init_nvidia_management_lib()
Initialize nvml (NVDIA management library), which is required to use pynvml.
Source code in latest/src/super_gradients/common/environment/monitoring/gpu/gpu.py
13 14 15 |
|
safe_init_nvidia_management_lib()
Initialize nvml (NVDIA management library), which is required to use pynvml. Return True on success.
Source code in latest/src/super_gradients/common/environment/monitoring/gpu/gpu.py
4 5 6 7 8 9 10 |
|
NVML_VALUE_NOT_AVAILABLE_uint = c_uint(-1)
module-attribute
Field Identifiers.
All Identifiers pertain to a device. Each ID is only used once and is guaranteed never to change.
NVMLError
Bases: Exception
Source code in latest/src/super_gradients/common/environment/monitoring/gpu/pynvml.py
622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 |
|
__new__(typ, value)
Maps value to a proper subclass of NVMLError. See _extractNVMLErrorsAsClasses function for more details
Source code in latest/src/super_gradients/common/environment/monitoring/gpu/pynvml.py
648 649 650 651 652 653 654 655 656 657 |
|
SystemMonitor
Monitor and write to tensorboard the system statistics, such as CPU usage, GPU, ...
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tensorboard_writer |
SummaryWriter
|
Tensorboard object that will be used to save the statistics |
required |
extra_gpu_stats |
bool
|
Set to True to get extra gpu statistics, such as gpu temperature, power usage, ... Default set to False, because this reduces the tensorboard readability. |
False
|
Source code in latest/src/super_gradients/common/environment/monitoring/monitoring.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
|
start(tensorboard_writer)
classmethod
Instantiate a SystemMonitor in a multiprocess safe way.
Source code in latest/src/super_gradients/common/environment/monitoring/monitoring.py
101 102 103 104 105 |
|
get_network_recv_mb()
Number of MegaBytes received since import
Source code in latest/src/super_gradients/common/environment/monitoring/network.py
13 14 15 |
|
get_network_sent_mb()
Number of MegaBytes sent since import
Source code in latest/src/super_gradients/common/environment/monitoring/network.py
8 9 10 |
|
reset_network_recv()
Reset the value of net_io_counters
Source code in latest/src/super_gradients/common/environment/monitoring/network.py
24 25 26 27 |
|
reset_network_sent()
Reset the value of net_io_counters
Source code in latest/src/super_gradients/common/environment/monitoring/network.py
18 19 20 21 |
|
average(samples, time_diff)
Average a list of values, return None if empty list
Source code in latest/src/super_gradients/common/environment/monitoring/utils.py
4 5 6 |
|
bytes_to_megabytes(b)
Convert bytes to megabytes
Source code in latest/src/super_gradients/common/environment/monitoring/utils.py
14 15 16 17 |
|
delta_per_s(samples, time_diff)
Compute the difference per second (ex. megabytes per second), return None if empty list
Source code in latest/src/super_gradients/common/environment/monitoring/utils.py
9 10 11 |
|
virtual_memory_used_percent()
Virtual memory used in percent.
Source code in latest/src/super_gradients/common/environment/monitoring/virtual_memory.py
4 5 6 |
|
RecipeShortcutsCallback
Bases: Callback
Interpolates the shortcuts defined in variable_set.yaml: lr batch_size val_batch_size ema epochs resume: False num_workers
When any of the above are not set, they will be populated with the original values (for example config.lr will be set with config.training_hyperparams.initial_lr) for clarity in logs.
Source code in latest/src/super_gradients/common/environment/omegaconf_utils.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
|
get_cls(cls_path)
A resolver for Hydra/OmegaConf to allow getting a class instead on an instance. usage: class_of_optimizer: ${class:torch.optim.Adam}
Source code in latest/src/super_gradients/common/environment/omegaconf_utils.py
61 62 63 64 65 66 67 68 69 70 |
|
register_hydra_resolvers()
Register all the hydra resolvers required for the super-gradients recipes.
Source code in latest/src/super_gradients/common/environment/omegaconf_utils.py
79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
|
normalize_path(path)
Normalize the directory of file path. Replace the Windows-style () path separators with unix ones (/). This is necessary when running on Windows since Hydra compose fails to find a configuration file is the config directory contains backward slash symbol.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
Input path string |
required |
Returns:
Type | Description |
---|---|
str
|
Output path string with all \ symbols replaces with /. |
Source code in latest/src/super_gradients/common/environment/path_utils.py
1 2 3 4 5 6 7 8 9 |
|