Does SAHI with YOLO11 make sense

Is there a reason to use SAHI with YOLO11, even though YOLO11 runtime performance scales linearly with respect to input resolution

Is it perhaps the multiscale feature extractor does not work well with big resolutions?

Measurements

Here is a summary of the performance metrics for each YOLO model at different input resolutions on an NVIDIA GeForce GTX 1650 Ti with TensorRT.

YOLOv10 Series Performance

Model Input Size Avg. Preprocess Time (ms) Avg. Inference Time (ms) Avg. Postprocess Time (ms) Avg. VRAM Used (MB)
yolov10n (320, 320) 1.2481 2.1734 0.8606 882.44
(480, 480) 1.8137 2.3863 0.5758 882.44
(640, 640) 2.8106 3.1079 0.5843 882.44
(960, 960) 5.7605 6.3184 0.5973 882.44
(1280, 1280) 9.3013 11.0719 0.5505 882.44
yolov10s (320, 320) 1.2377 3.1524 0.5724 1030.44
(480, 480) 1.8538 4.2261 0.5993 1030.44
(640, 640) 2.6764 6.3744 0.5907 1030.44
(960, 960) 5.7693 13.8813 0.5528 1030.44
(1280, 1280) 9.3562 25.7975 0.5460 1030.44
yolov10m (320, 320) 1.2352 5.6571 0.5924 1232.44
(480, 480) 1.8136 9.5817 0.5973 1232.44
(640, 640) 2.7824 15.3578 0.6080 1232.44
(960, 960) 5.6341 34.8607 0.6256 1232.44
(1280, 1280) 9.2966 61.3020 0.5545 1232.44
yolov10b (320, 320) 1.3440 7.3629 0.6265 1366.44
(480, 480) 1.8390 13.1377 0.6610 1366.44
(640, 640) 2.7249 21.8776 0.6012 1366.44
(960, 960) 5.6994 48.9557 0.6164 1366.44
(1280, 1280) 9.3457 86.7367 0.5631 1366.44
yolov10l (320, 320) 1.1543 9.3652 0.5938 1410.44
(480, 480) 1.8108 16.9321 0.6194 1410.44
(640, 640) 2.8090 28.5152 0.6220 1410.44
(960, 960) 5.5664 63.7005 0.6184 1410.44
(1280, 1280) 9.4435 112.4555 0.5680 1410.44
yolov10x (320, 320) 1.1555 13.1908 0.6079 1690.44
(480, 480) 1.7787 24.4779 0.6228 1690.44
(640, 640) 2.7059 41.6001 0.6006 1690.44
(960, 960) 5.7067 90.4771 0.5575 1690.44
(1280, 1280) 9.3776 157.9123 0.5594 1690.44

YOLOv11 Series Performance

Model Input Size Avg. Preprocess Time (ms) Avg. Inference Time (ms) Avg. Postprocess Time (ms) Avg. VRAM Used (MB)
yolo11n (320, 320) 1.2347 1.8521 1.8750 1014.44
(480, 480) 1.8643 2.4851 1.5385 1014.44
(640, 640) 2.7956 3.1183 1.5210 1014.44
(960, 960) 5.6465 6.0820 0.9226 1014.44
(1280, 1280) 9.2960 10.7876 0.9377 1014.44
yolo11s (320, 320) 1.2414 3.4239 1.5306 1080.44
(480, 480) 1.8326 4.2235 1.5452 1080.44
(640, 640) 2.8127 6.1560 1.5421 1080.44
(960, 960) 5.7622 13.2459 0.9688 1080.44
(1280, 1280) 9.2885 24.6474 1.1847 1080.44
yolo11m (320, 320) 1.1947 6.1232 1.5600 1376.44
(480, 480) 1.7759 9.9287 1.5427 1376.44
(640, 640) 2.7940 15.6612 1.5372 1376.44
(960, 960) 5.6145 35.5620 1.0836 1376.44
(1280, 1280) 9.2957 63.2755 1.3753 1376.44
yolo11l (320, 320) 1.2316 7.6474 1.5272 1424.44
(480, 480) 1.7991 12.9173 1.5550 1424.44
(640, 640) 2.7510 20.7198 1.5397 1424.44
(960, 960) 5.5351 46.7643 1.1085 1424.44
(1280, 1280) 9.2214 83.8281 1.3782 1424.44
yolo11x (320, 320) 1.2655 14.1565 1.5343 2056.44
(480, 480) 1.7940 26.1876 1.5846 2056.44
(640, 640) 2.7864 43.4137 1.5644 2056.44
(960, 960) 5.5922 98.3690 0.9486 2056.44
(1280, 1280) 9.3528 173.5436 1.2526 2056.44

###Note
YOLO12 does have sublinear performance scalling with respect to input resolution, but due to a software issue i can not run it with TensorRT

sources

The primary use case for SAHI is when the objects to be detected are extremely small relative to the image dimensions. If the objects are >= 20 pixels ^ 2 of a 640 x 640 image (object is roughly 3% of the image siz3), then it’s unlikely that there is a need for SAHI. The original publication for the SAHI method describes using this method for images with objects < 1% of the image width.

There will always be a trade off when needing to increase image resolution for inference, and even more so if using sliced inference. You’ll need to decide what the threshold is for your use case and to test if it meets your intended goal. That said, I would use SAHI only in cases where inference speed is not highly critical.

I did not realize SAHI had a research paper associated with it, thanks for pointing it out

The benefit from that I understand:
This allows detecting small objects without fine-tuning a model to detect small objects, for example when inferencing 1280x1280 image with 640x640 detector effectively increased the size of the objects relative to the input image the object detector receives

My follow up question is:
Is it possible to train/fine-tune a model such as YOLO11 to natively be able to detection small objects while running at high resolutions (such as 1280x1280, 1920x1080, or even 3840x2060) by training it with high resolution images. Or is the training algorithm/model architecture not capable at scaling at those high input resolution effectively.

Thank you very much!

Yes you can! That said, you may run into hardware issues, as larger images will require more GPU and system memory. Even if you have sufficient hardware, training will take longer. You can train at practically any resolution you’d like, however it’s quite likely there will be a point of diminishing return. Unfortunately the only way to find where the threshold is, will be to conduct testing.

1 Like