Is there a reason to use SAHI with YOLO11, even though YOLO11 runtime performance scales linearly with respect to input resolution
Is it perhaps the multiscale feature extractor does not work well with big resolutions?
Measurements
Here is a summary of the performance metrics for each YOLO model at different input resolutions on an NVIDIA GeForce GTX 1650 Ti with TensorRT.
YOLOv10 Series Performance
| Model | Input Size | Avg. Preprocess Time (ms) | Avg. Inference Time (ms) | Avg. Postprocess Time (ms) | Avg. VRAM Used (MB) |
|---|---|---|---|---|---|
| yolov10n | (320, 320) | 1.2481 | 2.1734 | 0.8606 | 882.44 |
| (480, 480) | 1.8137 | 2.3863 | 0.5758 | 882.44 | |
| (640, 640) | 2.8106 | 3.1079 | 0.5843 | 882.44 | |
| (960, 960) | 5.7605 | 6.3184 | 0.5973 | 882.44 | |
| (1280, 1280) | 9.3013 | 11.0719 | 0.5505 | 882.44 | |
| yolov10s | (320, 320) | 1.2377 | 3.1524 | 0.5724 | 1030.44 |
| (480, 480) | 1.8538 | 4.2261 | 0.5993 | 1030.44 | |
| (640, 640) | 2.6764 | 6.3744 | 0.5907 | 1030.44 | |
| (960, 960) | 5.7693 | 13.8813 | 0.5528 | 1030.44 | |
| (1280, 1280) | 9.3562 | 25.7975 | 0.5460 | 1030.44 | |
| yolov10m | (320, 320) | 1.2352 | 5.6571 | 0.5924 | 1232.44 |
| (480, 480) | 1.8136 | 9.5817 | 0.5973 | 1232.44 | |
| (640, 640) | 2.7824 | 15.3578 | 0.6080 | 1232.44 | |
| (960, 960) | 5.6341 | 34.8607 | 0.6256 | 1232.44 | |
| (1280, 1280) | 9.2966 | 61.3020 | 0.5545 | 1232.44 | |
| yolov10b | (320, 320) | 1.3440 | 7.3629 | 0.6265 | 1366.44 |
| (480, 480) | 1.8390 | 13.1377 | 0.6610 | 1366.44 | |
| (640, 640) | 2.7249 | 21.8776 | 0.6012 | 1366.44 | |
| (960, 960) | 5.6994 | 48.9557 | 0.6164 | 1366.44 | |
| (1280, 1280) | 9.3457 | 86.7367 | 0.5631 | 1366.44 | |
| yolov10l | (320, 320) | 1.1543 | 9.3652 | 0.5938 | 1410.44 |
| (480, 480) | 1.8108 | 16.9321 | 0.6194 | 1410.44 | |
| (640, 640) | 2.8090 | 28.5152 | 0.6220 | 1410.44 | |
| (960, 960) | 5.5664 | 63.7005 | 0.6184 | 1410.44 | |
| (1280, 1280) | 9.4435 | 112.4555 | 0.5680 | 1410.44 | |
| yolov10x | (320, 320) | 1.1555 | 13.1908 | 0.6079 | 1690.44 |
| (480, 480) | 1.7787 | 24.4779 | 0.6228 | 1690.44 | |
| (640, 640) | 2.7059 | 41.6001 | 0.6006 | 1690.44 | |
| (960, 960) | 5.7067 | 90.4771 | 0.5575 | 1690.44 | |
| (1280, 1280) | 9.3776 | 157.9123 | 0.5594 | 1690.44 |
YOLOv11 Series Performance
| Model | Input Size | Avg. Preprocess Time (ms) | Avg. Inference Time (ms) | Avg. Postprocess Time (ms) | Avg. VRAM Used (MB) |
|---|---|---|---|---|---|
| yolo11n | (320, 320) | 1.2347 | 1.8521 | 1.8750 | 1014.44 |
| (480, 480) | 1.8643 | 2.4851 | 1.5385 | 1014.44 | |
| (640, 640) | 2.7956 | 3.1183 | 1.5210 | 1014.44 | |
| (960, 960) | 5.6465 | 6.0820 | 0.9226 | 1014.44 | |
| (1280, 1280) | 9.2960 | 10.7876 | 0.9377 | 1014.44 | |
| yolo11s | (320, 320) | 1.2414 | 3.4239 | 1.5306 | 1080.44 |
| (480, 480) | 1.8326 | 4.2235 | 1.5452 | 1080.44 | |
| (640, 640) | 2.8127 | 6.1560 | 1.5421 | 1080.44 | |
| (960, 960) | 5.7622 | 13.2459 | 0.9688 | 1080.44 | |
| (1280, 1280) | 9.2885 | 24.6474 | 1.1847 | 1080.44 | |
| yolo11m | (320, 320) | 1.1947 | 6.1232 | 1.5600 | 1376.44 |
| (480, 480) | 1.7759 | 9.9287 | 1.5427 | 1376.44 | |
| (640, 640) | 2.7940 | 15.6612 | 1.5372 | 1376.44 | |
| (960, 960) | 5.6145 | 35.5620 | 1.0836 | 1376.44 | |
| (1280, 1280) | 9.2957 | 63.2755 | 1.3753 | 1376.44 | |
| yolo11l | (320, 320) | 1.2316 | 7.6474 | 1.5272 | 1424.44 |
| (480, 480) | 1.7991 | 12.9173 | 1.5550 | 1424.44 | |
| (640, 640) | 2.7510 | 20.7198 | 1.5397 | 1424.44 | |
| (960, 960) | 5.5351 | 46.7643 | 1.1085 | 1424.44 | |
| (1280, 1280) | 9.2214 | 83.8281 | 1.3782 | 1424.44 | |
| yolo11x | (320, 320) | 1.2655 | 14.1565 | 1.5343 | 2056.44 |
| (480, 480) | 1.7940 | 26.1876 | 1.5846 | 2056.44 | |
| (640, 640) | 2.7864 | 43.4137 | 1.5644 | 2056.44 | |
| (960, 960) | 5.5922 | 98.3690 | 0.9486 | 2056.44 | |
| (1280, 1280) | 9.3528 | 173.5436 | 1.2526 | 2056.44 |
###Note
YOLO12 does have sublinear performance scalling with respect to input resolution, but due to a software issue i can not run it with TensorRT
