I’m looking for a low-level explanation of how post-training quantization (PTQ) and, specifically, the INT8 calibration process works. My goal is to understand the underlying mechanics and best practices.
Here are my specific questions:
How do INT8 calibration algorithms work?
What are the best practices for the calibration dataset?
Should it be the same as the training dataset?
What are the consequences of using a different dataset?
How does Post-Training Quantization compare to Quantization-Aware Training (QAT)?
I honestly don’t know very well, but this is a great article.
A quick overview is that the process is calibrating the model weights to ensure that it ‘responds’ the same when converting to INT8 precision. That’s why you need the calibration data, which gives the calibration process a ‘test’ measurement to find the best conversion.
Calibrate with as much data as you can. Provide samples that are diverse and representative of your use case, definitely want to include a few examples of every class (at least). You can use your training, validation, and/or test set images, but you can probably also use it on new data, although it might not calibrate well. The article linked above might help with additional understanding.
PTQ is quick and easy compared to QAT, but generally you’ll lose some performance with PTQ. QAT will help the model stay as accurate as possible, but using PTQ from FP32 → INT8 will have some accuracy loss (generally FP32 → FP16 there is no accuracy loss).