Suppose that I have exported a custom model trained with YOLO12x to the OpenVINO format, then I have used this exported model to detect some kinds of objects in the videos, and my expectation is that the inference time for each frame in the videos is as fast as possible. Therefore, I have two following questions which need to be assisted:
Which type of quantization should I use (FP32, FP16, or INT8)?
Which type of device should I use (Intel CPU or Intel GPU)?
INT8 will always provide the fastest inference times, but you need to provide sufficient data for clarification or the accuracy of the model will decrease. If you don’t want the hassle of calibration, you can use FP16 (half) to get much faster speeds than the full FP32 model.
In general, using a GPU for inference will run faster. If you have the option of what kind of GPU to use, OpenVino might not necessarily be the correct excited format to use.
Are you trying to optimize inference specifically for Intel hardware or for any hardware?
@BurhanQ At the moment, I am owning a PC which consists of the Intel Core i5-10300H CPU and the Intel UHD Graphics GPU, and thus I want to optimize inference specifically for this kind of hardware.
The ONNX export workflow for INT8 quantization is not straightforward and if we add support, we will have to make sure it works for most if not all users. It also doesn’t offer anything better than what we already support because we provide INT8 support through other backends that are more optimized than ONNX, and if you’re looking for performance, you ought to just use a backend that’s best optimized for your hardware. Like TensorRT for NVIDIA, OpenVINO for CPUs, TFLite for ARM devices. They all support INT8.