Pose Estimation
Pose estimation is a computer vision task that involves estimating the position and orientation of objects or people in images or videos. It typically involves identifying specific keypoints or body parts, such as joints, and determining their relative positions and orientations. Pose estimation has numerous applications, including robotics, augmented reality, human-computer interaction, and sports analytics.
Top-down and bottom-up are two commonly used approaches in pose estimation. The main difference between top-down and bottom-up pose estimation approaches is the order in which the pose is estimated.
In a top-down approach, an object detection model is used to identify the object of interest, such as a person or a car, and a separate pose estimation model is used to estimate the keypoints of the object.
In contrast, a bottom-up approach first identifies individual body parts or joints and then connects them to form a complete pose.
In summary, top-down approach starts with detecting an object and then estimates its pose, while bottom-up approach first identifies the body parts and then forms a complete pose.
Implemented models
Model | Model class | Target Generator | Loss Class | Decoding Callback | Visualization Callback |
---|---|---|---|---|---|
DEKR | DEKRPoseEstimationModel | DEKRTargetsGenerator | DEKRLoss | DEKRPoseEstimationDecodeCallback | DEKRVisualizationCallback |
Training
For the sake of being specific in this tutorial, we will consider the training of DEKR
model in further explanations.
The easiest way to start training a pose estimation model is to use a recipe from SuperGradients.
Prerequisites
- You have to install SuperGradients first. Please refer to the Installation section for more details.
- Prepare the COCO dataset as described in the Computer Vision Datasets Setup under Pose Estimation Datasets section.
After you met the prerequisites, you can start training the model by running from the root of the repository:
Training from recipe
python -m super_gradients.train_from_recipe --config-name=coco2017_pose_dekr_w32 multi_gpu=Off num_gpus=1
Note, the default configuration for recipe is to use 8 GPUs in DDP mode. This hardware configuration may not be for everyone, so we in the example above we override GPU settings to use single GPU. It is highly recommended to read through the recipe file to get better understanding of the hyperparameters we use here. If you're unfamiliar with config files, we recommend you to read the Configuration Files part first.
The start of the config file looks like this:
defaults:
- training_hyperparams: coco2017_dekr_pose_train_params
- dataset_params: coco_pose_estimation_dekr_dataset_params
- arch_params: dekr_w32_arch_params
- checkpoint_params: default_checkpoint_params
- _self_
Here we define the default values for the following parameters:
* training_hyperparams
- These are our training hyperparameters. Things learning rate, optimizer, use of mixed precision, EMA and other training parameters are defined here.
You can refer to the default_train_params.yaml for more details.
In our example we use coco2017_dekr_pose_train_params.yaml that sets
training parameters as in DEKR paper.
* dataset_params
- These are the parameters for the training on COCO2017. The dataset configuration sets the dataset transformations (augmentations & preprocessing) and target generator for training the model.
* arch_params
- These are the parameters for the model architecture. In our example we use DEKRPoseEstimationModel that is a HRNet-based model with DEKR decoder.
* checkpoint_params
- These are the default parameters for resuming of training and using pretrained checkpoints.
You can refer to the default_checkpoint_params.yaml.
Datasets
There are several well-known datasets for pose estimation: COCO, MPII Human Pose, Hands in the Wild, CrowdPose, etc.
SuperGradients provide ready-to-use dataloaders for the COCO dataset COCOKeypointsDataset
and more general BaseKeypointsDataset
implementation that you can subclass from for your specific dataset format.
Target generators
The target generators are responsible for generating the target tensors for the model. Implementation of the target generator is model-specific and usually includes at least a multi-channel heatmap mask per joint.
Each model may require its own target generator implementation that is compatible with model's output.
All target generators should implement KeypointsTargetsGenerator
interface as shown below.
The goal of this class is to transform ground-truth annotations into a format that is suitable for computing a loss and training a model:
# super_gradients.training.datasets.pose_estimation_datasets.target_generators.KeypointsTargetsGenerator
import abc
import numpy as np
from torch import Tensor
from typing import Union, Tuple, Dict
class KeypointsTargetsGenerator:
@abc.abstractmethod
def __call__(self, image: Tensor, joints: np.ndarray, mask: np.ndarray) -> Union[Tensor, Tuple[Tensor, ...], Dict[str, Tensor]]:
"""
Encode input joints into target tensors
:param image: [C,H,W] Input image tensor
:param joints: [Num Instances, Num Joints, 3] Last channel represents (x, y, visibility)
:param mask: [H,W] Mask representing valid image areas. For instance, in COCO dataset crowd targets
are not used during training and corresponding instances will be zero-masked.
Your implementation may use this mask when generating targets.
:return: Encoded targets
"""
raise NotImplementedError()
SuperGradients provide implementation of DEKRTargetGenerator that is compatible with DEKR
model.
If you need to implement your own target generator, please refer to documentation of KeypointsTargetsGenerator
base class.
Metrics
A typical metric for pose estimation is the average precision (AP) and average recall (AR).
SuperGradients provide implementation of PoseEstimationMetrics
to compute AP/AR scores.
The metric is implemented as a callback that is called after each validation step. Implementation of the metric is made as close as possible to official metric implementation from COCO API. However, our implementation does NOT include computation of AP/AR scores per area range. It also natively support evaluation in DDP mode.
It is worth noting that usually reported AP/AR scores in papers are obtained using TTA (test-time augmentation) and additional postprocessing on top of the main model.
A horizontal flip is a common TTA technique that is used to increase accuracy of the predictions at the cost of running forward pass twice. Second common technique is a multi-scale approach when one perform inference additionally on 0.5x and 1.5x input resolution and aggregate predictions.
When training model using SuperGradients, we use neither of these techniques. If you want to measure AP/AR scores using TTA you may want to write your own evaluation loop for that.
In order to use PoseEstimationMetrics
you have to pass a so-called post_prediction_callback
to the metric, which is responsible for postprocessing of the model's raw output into final predictions.
Postprocessing
Postprocessing refers to a process of transforming the model's raw output into final predictions. Postprocessing is also model-specific and depends on the model's output format.
For DEKR
model, the postprocessing step is implemented in DEKRPoseEstimationDecodeCallback class.
When instantiating the metric, one has to pass a postprocessing callback as an argument:
training_hyperparams:
valid_metrics_list:
- PoseEstimationMetrics:
num_joints: ${dataset_params.num_joints}
oks_sigmas: ${dataset_params.oks_sigmas}
max_objects_per_image: 20
post_prediction_callback:
_target_: super_gradients.training.utils.pose_estimation.DEKRPoseEstimationDecodeCallback
max_num_people: 20
keypoint_threshold: 0.05
nms_threshold: 0.05
nms_num_threshold: 8
output_stride: 4
apply_sigmoid: False
Visualization
Visualization of the model predictions is a very important part of the training process for pose estimation models. By visualizing the predicted poses, developers and researchers can identify errors or inaccuracies in the model's output and adjust the model's architecture or training data accordingly.
Overall, visualization is an important tool for improving the accuracy and usability of pose estimation models, both during development and in real-world applications.
SuperGradients provide an implementation of DEKRVisualizationCallback
to visualize predictions for DEKR
model.
You can use this callback in your training pipeline to visualize predictions during training. To enable this callback, add the following lines to your training YAML recipe:
training_hyperparams:
resume: ${resume}
phase_callbacks:
- DEKRVisualizationCallback:
phase:
_target_: super_gradients.training.utils.callbacks.callbacks.Phase
value: TRAIN_BATCH_END
prefix: "train_"
mean: [ 0.485, 0.456, 0.406 ]
std: [ 0.229, 0.224, 0.225 ]
apply_sigmoid: False
- DEKRVisualizationCallback:
phase:
_target_: super_gradients.training.utils.callbacks.callbacks.Phase
value: VALIDATION_BATCH_END
prefix: "val_"
mean: [ 0.485, 0.456, 0.406 ]
std: [ 0.229, 0.224, 0.225 ]
apply_sigmoid: False
During training, the callback will generate a visualization of the model predictions and save it to the TensorBoard or Weights & Biases depending on which logger you are using (Default is Tensorboard). And result will look like this:
On the left side of the image there is input image with ground-truth keypoints overlay and on the right side there are same channel-wise sum of target and predicted heatmaps.
How to connect your own dataset
To add a new dataset to SuperGradients, you need to implement a few things:
- Implement a new dataset class
- Implement a new dataloader factory methods
- Add a configuration file
Let's unwrap each of the steps
Implement a new dataset class
To train an existing architecture on a new dataset one need to implement the dataset class first:
It is generally a good idea to subclass from BaseKeypointsDataset
that gives you a skeleton a dataset class and asks you to implement only a few methods to prepare your data for training.
A minimal implementation of a dataset class should look like this:
from super_gradients.training.datasets.pose_estimation_datasets import BaseKeypointsDataset
from super_gradients.training.datasets.pose_estimation_datasets import KeypointsTargetsGenerator
from super_gradients.training.transforms.keypoint_transforms import KeypointTransform
from typing import Tuple, Dict, Any, List
import numpy as np
import cv2
class MyNewPoseEstimationDataset(BaseKeypointsDataset):
def __init__(
self,
image_paths,
joint_paths,
target_generator: KeypointsTargetsGenerator,
transforms: List[KeypointTransform],
min_instance_area: float = 0.0,
):
super().__init__(target_generator, transforms, min_instance_area)
self.image_paths = image_paths
self.joint_paths = joint_paths
def __len__(self) -> int:
return len(self.image_paths)
def load_sample(self, index) -> Tuple[np.ndarray, np.ndarray, np.ndarray, Dict[str, Any]]:
"""
Read a sample from the disk and return (image, mask, joints, extras) tuple
:param index: Sample index
:return: Tuple of (image, mask, joints, extras)
image - Numpy array of [H,W,3] shape, which represents input RGB image
mask - Numpy array of [H,W] shape, which represents a binary mask with zero values corresponding to an
ignored region which should not be used for training (contribute to loss)
joints - Numpy array of [Num Instances, Num Joints, 3] shape, which represents the skeletons of the instances
extras - Dictionary of extra information about the sample that should be included in `extras` dictionary.
"""
# Read image from the disk
image = cv2.imread(self.image_paths[index])
mask = np.ones(image.shape[:2])
joints = np.loadtxt(self.joint_paths[index])
return image, mask, joints, {}
Implement a new dataloader factory methods
from super_gradients.training.dataloaders import get_data_loader
def my_new_dataset_pose_train(dataset_params: Dict = None, dataloader_params: Dict = None):
return get_data_loader(
config_name="coco_pose_estimation_dataset_params",
dataset_cls=MyNewPoseEstimationDataset,
train=True,
dataset_params=dataset_params,
dataloader_params=dataloader_params,
)
def my_new_dataset_pose_val(dataset_params: Dict = None, dataloader_params: Dict = None):
return get_data_loader(
config_name="coco_pose_estimation_dataset_params",
dataset_cls=MyNewPoseEstimationDataset,
train=False,
dataset_params=dataset_params,
dataloader_params=dataloader_params,
)
Add a configuration file
Create new my_new_dataset_dataset_params.yaml
file under dataset_params
folder. For the sake of simplicity, let's assume that we're going to train a DEKR model on human joints (17 keypoints as in COCO).
Then, the full configuration file should look like this:
# my_new_dataset_dataset_params.yaml
num_joints: 17
# OKs sigma values take from https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocotools/cocoeval.py#L523
oks_sigmas: [0.026, 0.025, 0.025, 0.035, 0.035, 0.079, 0.079, 0.072, 0.072, 0.062, 0.062, 1.007, 1.007, 0.087, 0.087, 0.089, 0.089]
train_dataset_params:
image_paths: /my_new_dataset/train/images
joint_paths: /my_new_dataset/train/annotations
min_instance_area: 128
transforms:
- KeypointsLongestMaxSize:
max_height: 640
max_width: 640
- KeypointsPadIfNeeded:
min_height: 640
min_width: 640
image_pad_value: [ 127, 127, 127 ]
mask_pad_value: 1
- KeypointsRandomHorizontalFlip:
# Note these indexes are COCO-specific. If you're using a different dataset, you'll need to change these accordingly.
flip_index: [ 0, 2, 1, 4, 3, 6, 5, 8, 7, 10, 9, 12, 11, 14, 13, 16, 15 ]
prob: 0.5
- KeypointsRandomAffineTransform:
max_rotation: 30
min_scale: 0.75
max_scale: 1.5
max_translate: 0.2
image_pad_value: [ 127, 127, 127 ]
mask_pad_value: 1
prob: 0.5
- KeypointsImageToTensor
- KeypointsImageNormalize:
mean: [ 0.485, 0.456, 0.406 ]
std: [ 0.229, 0.224, 0.225 ]
target_generator:
DEKRTargetsGenerator:
output_stride: 4
sigma: 2
center_sigma: 4
bg_weight: 0.1
offset_radius: 4
val_dataset_params:
image_paths: /my_new_dataset/train/images
joint_paths: /my_new_dataset/train/annotations
min_instance_area: 128
transforms:
- KeypointsLongestMaxSize:
max_height: 640
max_width: 640
- KeypointsPadIfNeeded:
min_height: 640
min_width: 640
image_pad_value: [ 127, 127, 127 ]
mask_pad_value: 1
- KeypointsImageToTensor
- KeypointsImageNormalize:
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
target_generator:
DEKRTargetsGenerator:
output_stride: 4
sigma: 2
center_sigma: 4
bg_weight: 0.1
offset_radius: 4
train_dataloader_params:
shuffle: True
batch_size: 8
num_workers: 8
drop_last: True
worker_init_fn:
_target_: super_gradients.training.utils.utils.load_func
dotpath: super_gradients.training.datasets.datasets_utils.worker_init_reset_seed
collate_fn:
_target_: super_gradients.training.datasets.pose_estimation_datasets.KeypointsCollate
val_dataloader_params:
batch_size: 24
num_workers: 8
drop_last: False
collate_fn:
_target_: super_gradients.training.datasets.pose_estimation_datasets.KeypointsCollate
_convert_: all
In your training recipe add/change the following lines to:
# my_new_dataset_train_recipe.yaml
defaults:
- training_hyperparams: ...
- dataset_params: my_new_dataset_dataset_params
- arch_params: ...
- checkpoint_params: ...
- _self_
train_dataloader: my_new_dataset_pose_train
val_dataloader: my_new_dataset_pose_val
...
And you should be good to go!
How to add a new model
To implement a new model, you need to add the following parts:
- Model architecture itself
- Target Generator
- Postprocessing Callback
- (Optional) Visualization Callback
A custom target generator class should inherit from KeypointsTargetsGenerator
base class which provides a protocol for generating target tensors for the ground-truth keypoints.
See DEKRTargetsGenerator for more details.
A custom postprocessing callback class should have a forward
method which takes raw model predictions and decode them into a final pose predictions.
See DEKRPoseEstimationDecodeCallback for more details.
A custom visualization callback class can inherit from PhaseCallback
or Callback
base class to generate a visualization of the model predictions.
See DEKRVisualizationCallback for more details.
Rescoring
A rescoring is a third stage of pose estimation (after model forward and nms) aimed to improve the confidence score of the predicted poses. In a nutshell, rescoring is a multiplication of the final confidence score predicted by the model by a scalar value computed by a rescoring model. By incorporating the learned prior knowledge about the body structure (in the form for joints linkage information) rescoring model can adjust the final pose confidence by downweighting the inaccurate of unlikely feasible poses and incresae confidence of poses that are more likely to be correct.
A rescoring model is a simple MLP model that takes the model predictions as tensor of [B, J, 3]
shape as input and outputs a single score for each pose prediction as tensor of [B,1]
shape.
Here B
represents batch dimension, J
number of joints and 3
is the dimension of the joint coordinates (x, y, confidence).
To train a rescoring model, you need to have a pretrained pose estimation model first.
SG-TODO: At this point in SG we don't have any pretrained models available. So we should train some models.
Training of rescoring model differs from the regular training in the following ways:
1. Generate the training data.
To train a rescoring model you need to generate the training data first. This assumes that you have a pretrained pose estimation model. To generate the dataset for rescoding model we run inference on the original dataset (COCO in this example) using our pretrained pose estimation model and save it's predictions to Pickle files.
The rescoring model input are poses [B,J,3]
and the outputs are the rescoring scores [B,1]
. The targets are computed object-keypoint similarity (OKs) scores between predicted pose and ground-truth pose.
Currently, rescoring is only supported for DEKR architecture.
python -m super_gradients.script.generate_rescoring_training_data --config-name=script_generate_rescoring_data_dekr_coco2017 rescoring_data_dir=OUTPUT_DATA_DIR checkpoint=PATH_TO_TRAINED_MODEL_CHECKPOINT`.
2. Train rescoring model.
The training data will be stored in output folder (In the example we use OUTPUT_DATA_DIR
placeholder). Once generated you can use this file to train rescoring model:
python -m super_gradients.train_from_recipe --config-name coco2017_pose_dekr_rescoring \
dataset_params.train_dataset_params.pkl_file=OUTPUT_DATA_DIR/rescoring_data_train.pkl \
dataset_params.val_dataset_params.pkl_file=OUTPUT_DATA_DIR/rescoring_data_valid.pkl
This recipe uses custom callback to compute pose estimation metrics on the validation dataset using coordinates of poses from step 1 and confidence values after rescoring.
See integration test case test_dekr_model_with_rescoring for more details and end-to-end usage example.