Choose appropriate configurations for YOLO model exported to OpenVINO format

tommerfrancis · August 19, 2025, 8:48am

Suppose that I have exported a custom model trained with YOLO12x to the OpenVINO format, then I have used this exported model to detect some kinds of objects in the videos, and my expectation is that the inference time for each frame in the videos is as fast as possible. Therefore, I have two following questions which need to be assisted:

Which type of quantization should I use (FP32, FP16, or INT8)?
Which type of device should I use (Intel CPU or Intel GPU)?

Thank you very much.

BurhanQ · August 19, 2025, 11:01am

INT8 will always provide the fastest inference times, but you need to provide sufficient data for clarification or the accuracy of the model will decrease. If you don’t want the hassle of calibration, you can use FP16 (half) to get much faster speeds than the full FP32 model.
In general, using a GPU for inference will run faster. If you have the option of what kind of GPU to use, OpenVino might not necessarily be the correct excited format to use.

Are you trying to optimize inference specifically for Intel hardware or for any hardware?

tommerfrancis · August 19, 2025, 12:13pm

@BurhanQ At the moment, I am owning a PC which consists of the Intel Core i5-10300H CPU and the Intel UHD Graphics GPU, and thus I want to optimize inference specifically for this kind of hardware.

BurhanQ · August 19, 2025, 1:12pm

For your set up, you probably want to use OpenVino for CPU. The integrated graphics is still running on the CPU, so I’d try to use OpenVino.

tommerfrancis · August 19, 2025, 1:17pm

@BurhanQ Although this question may be a little bit off-topic, I hope that you can explain: Why does the ONNX model not support the INT8 quantization?

Toxite · August 19, 2025, 1:34pm

The ONNX export workflow for INT8 quantization is not straightforward and if we add support, we will have to make sure it works for most if not all users. It also doesn’t offer anything better than what we already support because we provide INT8 support through other backends that are more optimized than ONNX, and if you’re looking for performance, you ought to just use a backend that’s best optimized for your hardware. Like TensorRT for NVIDIA, OpenVINO for CPUs, TFLite for ARM devices. They all support INT8.

Topic		Replies	Views
Speed up inference time for the model trained with YOLO12x YOLO	30	382	August 22, 2025
YOLO to lighter versions conversion for CPU deployement YOLO yolo , question , pytorch , onnx	1	219	June 25, 2025
Quantization YOLO yolo	11	1215	October 30, 2024
Yoloe inference very slow on jetson with tensorrt Discussion discussion , tensorrt	28	219	September 9, 2025
I would like to quantize my custom trained model YOLO question	1	664	January 5, 2025

Choose appropriate configurations for YOLO model exported to OpenVINO format

Related topics