Total execution time of `model.predict()` is way higher than inference time

Kallinteris-Andreas · August 18, 2025, 10:06am

Given this simple benchmark of inference speed

import os
import time
import psutil
from PIL import Image
from ultralytics import YOLO

detector = YOLO("yolo12m.pt", "cpu")

image_path = "debug/images/face_4096.jpg"
image = Image.open(image_path).convert("RGB")

process = psutil.Process(os.getpid())

while True:
    start_time = time.time()
    results = detector.predict(image)
    end_time = time.time()

    print(f"Detection took: {(end_time - start_time) * 1000:.4f} milliseconds")

i get a sample output of

...
0: 640x640 1 person, 2 chairs, 52.4ms
Speed: 3.4ms preprocess, 52.4ms inference, 1.2ms postprocess per image at shape (1, 3, 640, 640)
Detection took: 226.1436 milliseconds

0: 640x640 1 person, 2 chairs, 52.4ms
Speed: 3.3ms preprocess, 52.4ms inference, 1.1ms postprocess per image at shape (1, 3, 640, 640)
Detection took: 227.5164 milliseconds

0: 640x640 1 person, 2 chairs, 52.5ms
Speed: 3.7ms preprocess, 52.5ms inference, 1.3ms postprocess per image at shape (1, 3, 640, 640)
Detection took: 226.4342 milliseconds

0: 640x640 1 person, 2 chairs, 52.5ms
Speed: 3.4ms preprocess, 52.5ms inference, 1.2ms postprocess per image at shape (1, 3, 640, 640)
Detection took: 231.9701 milliseconds

0: 640x640 1 person, 2 chairs, 52.4ms
Speed: 3.7ms preprocess, 52.4ms inference, 1.3ms postprocess per image at shape (1, 3, 640, 640)
Detection took: 235.8549 milliseconds

0: 640x640 1 person, 2 chairs, 52.4ms
Speed: 3.8ms preprocess, 52.4ms inference, 1.3ms postprocess per image at shape (1, 3, 640, 640)
Detection took: 234.9603 milliseconds

0: 640x640 1 person, 2 chairs, 52.5ms
Speed: 3.6ms preprocess, 52.5ms inference, 1.2ms postprocess per image at shape (1, 3, 640, 640)
Detection took: 232.1908 milliseconds

0: 640x640 1 person, 2 chairs, 52.4ms
Speed: 3.8ms preprocess, 52.4ms inference, 1.4ms postprocess per image at shape (1, 3, 640, 640)
Detection took: 237.7448 milliseconds
...

here we can see that the total execution time (~225-230ms) is much longer than the total process + inference speed time (~56-57ms)

Note: I have tested with stream=True argument, and I get the same behavior

BurhanQ · August 18, 2025, 10:47am

The major difference is that you’re measuring at the application level, not at the execution level of the code. That difference includes a lot of additional overhead, which very easily will increase the time measured.

The times reported by the ultralytics library are measured here:

github.com/ultralytics/ultralytics

ultralytics/engine/predictor.py

bc62fa95a


      
          # Preprocess
          with profilers[0]:
              im = self.preprocess(im0s)
          
          # Inference
          with profilers[1]:
              preds = self.inference(im, *args, **kwargs)
              if self.args.embed:
                  yield from [preds] if isinstance(preds, torch.Tensor) else preds  # yield embedding tensors
                  continue
          
          # Postprocess
          with profilers[2]:
              self.results = self.postprocess(preds, im, im0s)

and the Profile class used to measure can be found here:

github.com/ultralytics/ultralytics

ultralytics/utils/ops.py

bc62fa95a


      
          
          import cv2
          import numpy as np
          import torch
          import torch.nn.functional as F
          
          from ultralytics.utils import LOGGER
          from ultralytics.utils.metrics import batch_probiou
          
          
          class Profile(contextlib.ContextDecorator):
              """
              Ultralytics Profile class for timing code execution.
          
              Use as a decorator with @Profile() or as a context manager with 'with Profile():'. Provides accurate timing
              measurements with CUDA synchronization support for GPU operations.
          
              Attributes:
                  t (float): Accumulated time in seconds.
                  device (torch.device): Device used for model inference.
                  cuda (bool): Whether CUDA is being used for timing synchronization.

Kallinteris-Andreas · August 20, 2025, 3:02pm

After reading predictor.stream_inference it is not clear to me what takes all those milliseconds outside of the preprocess, inference, postprocess functions

Regardless, is there a way to speed up that code, or is it just expected

Note: I have found a lot of information for speeding up inference by exporting to other format such as tensorRT and ONNX and quantizing etc, but no information on speeding up the surrounding code.

Toxite · August 20, 2025, 3:14pm

You can try bypassing the intermediate steps and perform inference directly:

github.com/ultralytics/ultralytics

Facing issue to understand time difference

opened 04:38PM - 31 Jul 24 UTC

closed 12:21AM - 10 Nov 24 UTC

sourabCV

question Stale pose

### Search before asking - [X] I have searched the Ultralytics YOLO [issues](…https://github.com/ultralytics/ultralytics/issues) and [discussions](https://github.com/ultralytics/ultralytics/discussions) and found no similar questions. ### Question simple code : ``` import time import sys import cv2 from ultralytics import YOLO pos_model = YOLO('models/yolov8n-pose.onnx', task='pose') cap = cv2.VideoCapture( 'rtsp://xxx:xxxx!@0xxxx:1xx77/stream1') ret, frame = cap.read() st = time.time() results = pos_model.predict( frame, conf=0.5, iou=0.6 )#, imgsz=640, verbose=False) print(time.time()-st) ``` Issue : output showing Ultralytics YOLOv8.1.34 🚀 Python-3.11.9 torch-2.0.1+cpu CPU (Intel Xeon Platinum 8370C 2.80GHz) Setup complete ✅ (64 CPUs, 256.0 GB RAM, 92.6/126.4 GB disk) Loading models\yolov8n-pose.onnx for ONNX Runtime inference... 0: 640x640 (no detections), 46.5ms Speed: 60.5ms preprocess, 46.5ms inference, 1.0ms postprocess per image at shape (1, 3, 640, 640) 1.9199035167694092 what issue is there as per 60.5ms +46.5ms +1.0ms =0.108 sec but next line printing 1.91 sec ### Additional I am running this code in azure windows vm in cpu

Kallinteris-Andreas · August 20, 2025, 3:41pm

Thanks, your response let me to a path of testing the source type

turns out a lot of the time is spent converting Pil.Image to np.ndarray every time predict was called

image = np.array(image)

I originally assume data type conversion would be handled by the preprocess function

pderrenger · August 21, 2025, 1:27pm

Great find — source type matters a lot.

Ultralytics does handle conversions in preprocess, but giving it a NumPy array or Torch tensor avoids repeated PIL→NumPy work and path/loader setup each call. For tight loops, pass a preconverted np.ndarray (HWC, BGR uint8) or a preallocated torch.Tensor (BCHW, float32 in [0,1]) and reuse the same YOLO instance so predictor/dataset aren’t rebuilt. Also disable extras like save/show/verbose.

Minimal example:
from ultralytics import YOLO
import cv2, torch

model = YOLO(“yolo11n.pt”) # recommend YOLO11 for best performance
im_bgr = cv2.imread(“debug/images/face_4096.jpg”) # np.ndarray

or: im_t = torch.from_numpy(im_bgr[…, ::-1]).permute(2,0,1)[None].float()/255

for _ in range(1000):
_ = model.predict(im_bgr, stream=False, verbose=False, save=False, imgsz=640)

If you still see large gaps vs. the internal speed breakdown, try stream=True for generators on videos/dirs, and profile outside of preprocess/inference/postprocess; some overhead can come from dataset setup, callbacks, plotting/logging, and Python loop timing. Reference: predictor preprocess and call flow are documented in the Ultralytics engine predictor docs. See the Ultralytics Predictor reference for preprocess and call details.

Topic		Replies	Views
Performance issues with yolo model reading results YOLO yolo , question , support , troubleshooting , code	10	815	February 12, 2025
Yoloe inference very slow on jetson with tensorrt Discussion discussion , tensorrt	28	36	September 9, 2025
Yolov8 model slowing down in real time detection after 2 hours Support support , bug-fix , discussion	6	35	August 20, 2025
Speed up inference time for the model trained with YOLO12x YOLO	30	107	August 22, 2025
Why predictions are different in image segmentation after some times? YOLO question	1	48	October 23, 2024

Total execution time of `model.predict()` is way higher than inference time

or: im_t = torch.from_numpy(im_bgr[…, ::-1]).permute(2,0,1)[None].float()/255

Related topics