Openai/691c33ba-8898-800c-b30f-1383bae461b1

This conversation was exported from ChatGPT^[1]

gpt-5-1-thinking[edit | edit source]

User: inference speed up possibilities: We have rfdetr for detection in video, rtmpose from mmdetection for pose, some dino b…[edit | edit source]

inference speed up possibilities: We have rfdetr for detection in video, rtmpose from mmdetection for pose, some dino based action recognition

we are using batched SAHI like detection

we are running inference with this decorator def torch_inference_mode(func):

   """
   Decorator that runs a function under torch inference mode for faster runtime.

   IMPORTANT: The function cannot be decorated if model initialization is done inside the wrapped function call.
   """
   @wraps(func)
   def wrapper(*args, **kwargs):
       with torch_inference_mode_block():
           return func(*args, **kwargs)
   return wrapper

def torch_inference_mode_block():

   """
   Context manager that runs code under torch inference mode for faster runtime.
   This can be used as a general block for inference-related operations.

   Example:
       with torch_inference_mode_block():
           # Your inference code here
   """
   class TorchInferenceModeBlock:
       def __enter__(self):
           self.inference_mode = torch.inference_mode()
           self.autocast = torch.amp.autocast('cuda')
           self.inference_mode.__enter__()
           self.autocast.__enter__()
           return self

       def __exit__(self, exc_type, exc_value, traceback):
           self.autocast.__exit__(exc_type, exc_value, traceback)
           self.inference_mode.__exit__(exc_type, exc_value, traceback)

   return TorchInferenceModeBlock()

we are not using onnx, no quantized models

How to speed up the inference mainly detection and pose. Sort by complexity / effect ratio.

Assistant: Cool, this is exactly the level of detail I like 😄 Let’s go from highest effect / lowest complexity → to more invasive…[edit | edit source]

Cool, this is exactly the level of detail I like 😄 Let’s go from highest effect / lowest complexity → to more invasive stuff, focusing on detection + pose.

0. Quick baseline (15–30 min)[edit | edit source]

Before changing anything, make sure you know where time goes:

Run with torch.profiler around the whole pipeline (preproc → detection → SAHI tiling → pose → postproc).
Watch nvidia-smi dmon while running: - If GPU util is low (<60%), you’re CPU / I/O / Python-bound. - If GPU is pegged, then it's model / kernel-bound.

This will tell you whether to focus on batching/pipeline vs model-level tricks.

1. “Cheap” wins that often give 20–50% (or more)[edit | edit source]

1.1 Batch smarter: frames + tiles together[edit | edit source]

Impact: high • Complexity: low–medium

You said “batched SAHI-like detection” – good. A few extra tweaks:

Batch across frames as well as tiles where possible. - Instead of: for frame in frames → SAHI tiles → model per frame - Do: collect tiles from several frames → run one big batch through RFDetr.
For pose: batch all crops from a frame (or several frames) and run RTMPose once per batch.

Rule of thumb: try to push batch size until:

GPU mem ~80–90% full
No OOM, and GPU util is high.

1.2 Use half precision end-to-end (not just autocast)[edit | edit source]

Impact: medium–high • Complexity: low

Right now your decorator does:

self.autocast = torch.amp.autocast('cuda')

Consider:

Put models permanently in FP16: ``

python model = model.half().eval().to("cuda") - Make sure inputs are float16 too, or wrapped in autocast only around the actual forward.

For many transformer-ish models (RFDetr, DINO) this is a solid speed + memory win. Watch out for:
* Any ops that don’t support FP16 (might need a small to(torch.float32) around them).

===== 1.3 Make your decorator more focused =====

Impact: small–medium • Complexity: low

Right now torch_inference_mode_block() is wrapping everything under inference + autocast (including Python control logic). Nice and clean, but you can:
* Narrow the scope so it only wraps the model forward passes, not the whole big function that does pre/post-processing, SAHI tiling, etc.
* That reduces some context-manager overhead and makes it easier to ensure that: - SAHI / CPU side runs in normal mode - Only GPU forwards are in autocast/inference_mode.

Not huge, but free.

===== 1.4 Turn on channels_last + pinned memory =====

Impact: medium • Complexity: low

For conv-heavy parts (RTMPose especially):
python

model = model.to(memory_format=torch.channels_last)
input = input.to(device, memory_format=torch.channels_last, non_blocking=True)

And for DataLoader / frame feeding:
* Use pin_memory=True and .to(device, non_blocking=True) to overlap H2D copies with compute.

===== 1.5 Reduce resolution / SAHI aggressiveness =====

Impact: huge if you can afford it • Complexity: low
* Check if you can: - Run RFDetr on downscaled frames (e.g. 1080p → 720p), then rescale boxes back. - Make SAHI tiles larger and with less overlap: - Fewer tiles → fewer forwards.
* Often, a small drop in detection quality = massive speedup.

===== 1.6 cuDNN / PyTorch 2 compiler =====

Impact: small–medium per model, sometimes big • Complexity: low
* Set:

python torch.backends.cudnn.benchmark = True # if varying input sizes

- If you’re on PyTorch 2:

python model = torch.compile(model, mode="max-autotune") `` Especially for RTMPose and DINO; watch for weird dynamic shapes (try to keep SAHI tile sizes fixed).

2. Pipeline-level tricks (bigger wins, more thinking)[edit | edit source]

2.1 Run detection less frequently + track in between[edit | edit source]

Impact: often 2–10× on detection cost • Complexity: medium

For video you do not need full RFDetr every frame.

Typical pattern:

Run RFDetr every N frames (e.g. every 3–5 frames).
Between those, use a tracker (ByteTrack / StrongSORT / even simple IoU + velocity) to propagate boxes.
Only rerun detection when: - Track confidence drops - Scene changes a lot (background difference, many lost tracks, etc.)

That alone can cut detector calls by ~3–5×.

2.2 Subsample pose frames & entities[edit | edit source]

Impact: large if pose is heavy • Complexity: low–medium

For RTMPose:

If action recognition doesn’t need per-frame pose, run pose: - every K frames only - or only on keyframes used in DINO’s temporal window.
Only run pose on: - Top-K detections by score - Or only objects that are relevant (e.g. horse + rider classes, skip background people etc.)

2.3 Fuse pose + action recognition logic[edit | edit source]

Impact: medium–high • Complexity: medium

DINO-based action recognition likely uses:

RGB clip alone, or
RGB + pose (joint coordinates).

You can:

Align the pose frame sampling with the clip sampling of DINO: - If DINO uses e.g. 16 frames sampled over 2 seconds, only compute pose on those 16 frames, not on every single frame.
Precompute pose batch for whole clip and feed it in one go to DINO (or the head that uses pose).

This avoids redundant pose calls.

2.4 Asynchronous pipeline: pre/post on CPU, models on GPU[edit | edit source]

Impact: medium • Complexity: medium

Use producer/consumer queues: - Thread/process 1: read video, decode, pre-process, enqueue batches. - Thread/process 2 (GPU worker): picks batches, runs RFDetr/RTMPose/DINO, enqueues results. - Thread/process 3: post-process + write outputs.

For Python, even multiprocessing + torch.multiprocessing can give nice speed-ups if you were CPU-bound in decoding/SAHI/cropping.

3. Heavier but powerful options[edit | edit source]

3.1 ONNX / TensorRT / Torch-TensorRT[edit | edit source]

Impact: often 1.5–3× per model • Complexity: medium–high

You said you’re not using ONNX / quantization – fair. But for production video pipelines these are worth planning:

Export RFDetr & RTMPose to ONNX.
Build TensorRT engines with: - FP16 kernels - Fixed input shapes (helps especially with SAHI if you standardize tile size).
Wrap them behind the same API so the upper pipeline doesn’t care.

This often beats raw PyTorch, especially on repeated inference with same shapes.

3.2 INT8 / hybrid quantization[edit | edit source]

Impact: up to 2–4× vs FP32 • Complexity: high

If you can tolerate some accuracy loss:

Calibrate on a handful of “typical” videos.
Use INT8 for detector backbone + neck, maybe keep heads in FP16 if they’re sensitive.

I’d only go here once you’ve exhausted FP16 + pipeline optimizations.

3.3 Model-level simplifications / distillation[edit | edit source]

Impact: huge long-term • Complexity: high

Distill RFDetr to a smaller variant or even a YOLO-like model if that’s “good enough” for your task.
Distill RTMPose to a smaller backbone.
If DINO-based action recognition is the bottleneck, distill it to a lighter temporal model (e.g. a small 3D CNN or transformer head working on pose only).

This is real research/engineering work, but it gives you structured control over the speed/accuracy trade-off.

4. Very concrete checklist you can try next[edit | edit source]

If I had your codebase in front of me, I’d do this in roughly this order:

✅ Confirm all models are .eval() and on GPU, with torch.inference_mode or no_grad around only their forwards.
✅ Set: - model.half() + FP16 inputs (or at least clean autocast scopes). - channels_last and pinned memory.
✅ Max out useful batch size across: - SAHI tiles - Multiple frames - Pose crops
✅ Turn on torch.backends.cudnn.benchmark and, if on torch 2, try torch.compile on RFDetr & RTMPose.
✅ Reduce detection frequency using a simple tracker between keyframes.
✅ Only compute pose on: - keyframes - key entities - and align with DINO clip sampling.
✅ If still too slow → prototype TensorRT/ONNX for one model (probably RFDetr) and measure.

If you want, you can paste a simplified version of your detection+pose loop (just the control flow, no proprietary bits), and I can annotate it with concrete “put autocast here”, “batch these tensors here”, “move this to another process” comments.

References[edit | edit source]

↑ "Inference speed up tips". ChatGPT. Retrieved 2025-12-05.

[1] "Inference speed up tips". ChatGPT. Retrieved 2025-12-05.

[1]