Hello,
I have images of size 1980x1080. I put them into yolo11-cls with imgsz = 1024.
I was wondering how the resize to 1024 happens?
From what i read on the internet the imgs get resized to 1024x768 to keep the image ratio, and then the image get padded on the right and left to get 1024x1024 images.
The problem is that when i run inference, On resizing of the image for inference to 1024x1024 i get better results then if i resize to 1024x768 and then add padding to 1024x1024.
I was expecting doing the same process as the YOLO will give the better results, so i wonder if my understanding of the resize process is right?
Thank you for the fast answer!
Where scale_size comes from? is it the value we put in imgsz when calling predict on yolo?
if that the case, isn’t yolo classification always forces that scale_size[0] == scalie_size[1]?
does that means the preprocess step is only resizing??
thanks!
I believe it’s derived from imgsz but it’s calculated just above those lines:
where size is an argument to the classify_transforms function, and I presume would use the imgsz argument when called.
It does resize, but if you look at L2598, T.CenterCrop(size) is included as part of the transformations, so there appears to be some center cropping that occurs, which when images are not square with equal dimensions could cutoff parts of the image. I’m not aware why this is used instead of padding (which is used for object detection).
I looked into this while testing yolo11-cls on my own dataset. It seems like the resize step maintains aspect ratio by default and adds padding if needed—kind of like letterboxing. I had to tweak it a bit because my input images were all square and didn’t need padding.
I’m using 224x224 as input size with grayscale images, and it handled them fine.
Check the albumentations section in the data config—resize logic is applied there before batching.
You can override the default transforms in the val or predict functions if needed.
Thanks for your detailed question and for looking into the preprocessing steps.
For YOLO11-cls during prediction/validation with a single integer imgsz (e.g., imgsz=1024), the image preprocessing is typically handled by the classify_transforms function. This involves two main steps:
The image is first resized such that its shortest edge becomes equal to imgsz (1024 in your case), while maintaining the aspect ratio. For your 1980x1080 image, the 1080 dimension (height) would be scaled to 1024, and the width (1980) would be scaled proportionally to approximately 1877. So the image becomes roughly 1877x1024.
A center crop of (imgsz, imgsz) (i.e., 1024x1024) is then taken from this resized image.
This process differs from the letterboxing approach (resizing to fit within 1024x1024 and then padding) you described, which explains why you’re observing different results. The scale_size is indeed derived from the imgsz you provide, and if imgsz is an integer, scale_size will effectively be (imgsz, imgsz). The preprocessing isn’t just resizing; it’s a resize followed by a center crop to achieve the final square input.