Total execution time of `model.predict()` is way higher than inference time

Given this simple benchmark of inference speed

import os
import time
import psutil
from PIL import Image
from ultralytics import YOLO

detector = YOLO("yolo12m.pt", "cpu")

image_path = "debug/images/face_4096.jpg"
image = Image.open(image_path).convert("RGB")

process = psutil.Process(os.getpid())

while True:
    start_time = time.time()
    results = detector.predict(image)
    end_time = time.time()

    print(f"Detection took: {(end_time - start_time) * 1000:.4f} milliseconds")

i get a sample output of

...
0: 640x640 1 person, 2 chairs, 52.4ms
Speed: 3.4ms preprocess, 52.4ms inference, 1.2ms postprocess per image at shape (1, 3, 640, 640)
Detection took: 226.1436 milliseconds

0: 640x640 1 person, 2 chairs, 52.4ms
Speed: 3.3ms preprocess, 52.4ms inference, 1.1ms postprocess per image at shape (1, 3, 640, 640)
Detection took: 227.5164 milliseconds

0: 640x640 1 person, 2 chairs, 52.5ms
Speed: 3.7ms preprocess, 52.5ms inference, 1.3ms postprocess per image at shape (1, 3, 640, 640)
Detection took: 226.4342 milliseconds

0: 640x640 1 person, 2 chairs, 52.5ms
Speed: 3.4ms preprocess, 52.5ms inference, 1.2ms postprocess per image at shape (1, 3, 640, 640)
Detection took: 231.9701 milliseconds

0: 640x640 1 person, 2 chairs, 52.4ms
Speed: 3.7ms preprocess, 52.4ms inference, 1.3ms postprocess per image at shape (1, 3, 640, 640)
Detection took: 235.8549 milliseconds

0: 640x640 1 person, 2 chairs, 52.4ms
Speed: 3.8ms preprocess, 52.4ms inference, 1.3ms postprocess per image at shape (1, 3, 640, 640)
Detection took: 234.9603 milliseconds

0: 640x640 1 person, 2 chairs, 52.5ms
Speed: 3.6ms preprocess, 52.5ms inference, 1.2ms postprocess per image at shape (1, 3, 640, 640)
Detection took: 232.1908 milliseconds

0: 640x640 1 person, 2 chairs, 52.4ms
Speed: 3.8ms preprocess, 52.4ms inference, 1.4ms postprocess per image at shape (1, 3, 640, 640)
Detection took: 237.7448 milliseconds
...

here we can see that the total execution time (~225-230ms) is much longer than the total process + inference speed time (~56-57ms)

Note: I have tested with stream=True argument, and I get the same behavior

The major difference is that you’re measuring at the application level, not at the execution level of the code. That difference includes a lot of additional overhead, which very easily will increase the time measured.

The times reported by the ultralytics library are measured here:

and the Profile class used to measure can be found here:

1 Like

After reading predictor.stream_inference it is not clear to me what takes all those milliseconds outside of the preprocess, inference, postprocess functions

Regardless, is there a way to speed up that code, or is it just expected

Note: I have found a lot of information for speeding up inference by exporting to other format such as tensorRT and ONNX and quantizing etc, but no information on speeding up the surrounding code.

You can try bypassing the intermediate steps and perform inference directly:

Thanks, your response let me to a path of testing the source type

turns out a lot of the time is spent converting Pil.Image to np.ndarray every time predict was called

image = np.array(image)

I originally assume data type conversion would be handled by the preprocess function

Great find — source type matters a lot.

Ultralytics does handle conversions in preprocess, but giving it a NumPy array or Torch tensor avoids repeated PIL→NumPy work and path/loader setup each call. For tight loops, pass a preconverted np.ndarray (HWC, BGR uint8) or a preallocated torch.Tensor (BCHW, float32 in [0,1]) and reuse the same YOLO instance so predictor/dataset aren’t rebuilt. Also disable extras like save/show/verbose.

Minimal example:
from ultralytics import YOLO
import cv2, torch

model = YOLO(“yolo11n.pt”) # recommend YOLO11 for best performance
im_bgr = cv2.imread(“debug/images/face_4096.jpg”) # np.ndarray

or: im_t = torch.from_numpy(im_bgr[…, ::-1]).permute(2,0,1)[None].float()/255

for _ in range(1000):
_ = model.predict(im_bgr, stream=False, verbose=False, save=False, imgsz=640)

If you still see large gaps vs. the internal speed breakdown, try stream=True for generators on videos/dirs, and profile outside of preprocess/inference/postprocess; some overhead can come from dataset setup, callbacks, plotting/logging, and Python loop timing. Reference: predictor preprocess and call flow are documented in the Ultralytics engine predictor docs. See the Ultralytics Predictor reference for preprocess and call details.

1 Like