Does SAHI with YOLO11 make sense

Kallinteris-Andreas · September 15, 2025, 7:05pm

Is there a reason to use SAHI with YOLO11, even though YOLO11 runtime performance scales linearly with respect to input resolution

Is it perhaps the multiscale feature extractor does not work well with big resolutions?

Measurements

Here is a summary of the performance metrics for each YOLO model at different input resolutions on an NVIDIA GeForce GTX 1650 Ti with TensorRT.

YOLOv10 Series Performance

Model	Input Size	Avg. Preprocess Time (ms)	Avg. Inference Time (ms)	Avg. Postprocess Time (ms)	Avg. VRAM Used (MB)
yolov10n	(320, 320)	1.2481	2.1734	0.8606	882.44
	(480, 480)	1.8137	2.3863	0.5758	882.44
	(640, 640)	2.8106	3.1079	0.5843	882.44
	(960, 960)	5.7605	6.3184	0.5973	882.44
	(1280, 1280)	9.3013	11.0719	0.5505	882.44
yolov10s	(320, 320)	1.2377	3.1524	0.5724	1030.44
	(480, 480)	1.8538	4.2261	0.5993	1030.44
	(640, 640)	2.6764	6.3744	0.5907	1030.44
	(960, 960)	5.7693	13.8813	0.5528	1030.44
	(1280, 1280)	9.3562	25.7975	0.5460	1030.44
yolov10m	(320, 320)	1.2352	5.6571	0.5924	1232.44
	(480, 480)	1.8136	9.5817	0.5973	1232.44
	(640, 640)	2.7824	15.3578	0.6080	1232.44
	(960, 960)	5.6341	34.8607	0.6256	1232.44
	(1280, 1280)	9.2966	61.3020	0.5545	1232.44
yolov10b	(320, 320)	1.3440	7.3629	0.6265	1366.44
	(480, 480)	1.8390	13.1377	0.6610	1366.44
	(640, 640)	2.7249	21.8776	0.6012	1366.44
	(960, 960)	5.6994	48.9557	0.6164	1366.44
	(1280, 1280)	9.3457	86.7367	0.5631	1366.44
yolov10l	(320, 320)	1.1543	9.3652	0.5938	1410.44
	(480, 480)	1.8108	16.9321	0.6194	1410.44
	(640, 640)	2.8090	28.5152	0.6220	1410.44
	(960, 960)	5.5664	63.7005	0.6184	1410.44
	(1280, 1280)	9.4435	112.4555	0.5680	1410.44
yolov10x	(320, 320)	1.1555	13.1908	0.6079	1690.44
	(480, 480)	1.7787	24.4779	0.6228	1690.44
	(640, 640)	2.7059	41.6001	0.6006	1690.44
	(960, 960)	5.7067	90.4771	0.5575	1690.44
	(1280, 1280)	9.3776	157.9123	0.5594	1690.44

YOLOv11 Series Performance

Model	Input Size	Avg. Preprocess Time (ms)	Avg. Inference Time (ms)	Avg. Postprocess Time (ms)	Avg. VRAM Used (MB)
yolo11n	(320, 320)	1.2347	1.8521	1.8750	1014.44
	(480, 480)	1.8643	2.4851	1.5385	1014.44
	(640, 640)	2.7956	3.1183	1.5210	1014.44
	(960, 960)	5.6465	6.0820	0.9226	1014.44
	(1280, 1280)	9.2960	10.7876	0.9377	1014.44
yolo11s	(320, 320)	1.2414	3.4239	1.5306	1080.44
	(480, 480)	1.8326	4.2235	1.5452	1080.44
	(640, 640)	2.8127	6.1560	1.5421	1080.44
	(960, 960)	5.7622	13.2459	0.9688	1080.44
	(1280, 1280)	9.2885	24.6474	1.1847	1080.44
yolo11m	(320, 320)	1.1947	6.1232	1.5600	1376.44
	(480, 480)	1.7759	9.9287	1.5427	1376.44
	(640, 640)	2.7940	15.6612	1.5372	1376.44
	(960, 960)	5.6145	35.5620	1.0836	1376.44
	(1280, 1280)	9.2957	63.2755	1.3753	1376.44
yolo11l	(320, 320)	1.2316	7.6474	1.5272	1424.44
	(480, 480)	1.7991	12.9173	1.5550	1424.44
	(640, 640)	2.7510	20.7198	1.5397	1424.44
	(960, 960)	5.5351	46.7643	1.1085	1424.44
	(1280, 1280)	9.2214	83.8281	1.3782	1424.44
yolo11x	(320, 320)	1.2655	14.1565	1.5343	2056.44
	(480, 480)	1.7940	26.1876	1.5846	2056.44
	(640, 640)	2.7864	43.4137	1.5644	2056.44
	(960, 960)	5.5922	98.3690	0.9486	2056.44
	(1280, 1280)	9.3528	173.5436	1.2526	2056.44

###Note
YOLO12 does have sublinear performance scalling with respect to input resolution, but due to a software issue i can not run it with TensorRT

sources

BurhanQ · September 16, 2025, 12:27pm

The primary use case for SAHI is when the objects to be detected are extremely small relative to the image dimensions. If the objects are >= 20 pixels ^ 2 of a 640 x 640 image (object is roughly 3% of the image siz3), then it’s unlikely that there is a need for SAHI. The original publication for the SAHI method describes using this method for images with objects < 1% of the image width.

There will always be a trade off when needing to increase image resolution for inference, and even more so if using sliced inference. You’ll need to decide what the threshold is for your use case and to test if it meets your intended goal. That said, I would use SAHI only in cases where inference speed is not highly critical.

Kallinteris-Andreas · September 17, 2025, 1:03pm

I did not realize SAHI had a research paper associated with it, thanks for pointing it out

The benefit from that I understand:
This allows detecting small objects without fine-tuning a model to detect small objects, for example when inferencing 1280x1280 image with 640x640 detector effectively increased the size of the objects relative to the input image the object detector receives

My follow up question is:
Is it possible to train/fine-tune a model such as YOLO11 to natively be able to detection small objects while running at high resolutions (such as 1280x1280, 1920x1080, or even 3840x2060) by training it with high resolution images. Or is the training algorithm/model architecture not capable at scaling at those high input resolution effectively.

Thank you very much!

BurhanQ · September 18, 2025, 11:32am

Yes you can! That said, you may run into hardware issues, as larger images will require more GPU and system memory. Even if you have sufficient hardware, training will take longer. You can train at practically any resolution you’d like, however it’s quite likely there will be a point of diminishing return. Unfortunately the only way to find where the threshold is, will be to conduct testing.

Topic		Replies	Views
Why is it that when I use SAHI with YOLOv11 for tiny object detection, the results are actually worse than just using YOLOv11 alone? Support support , discussion	3	77	September 24, 2025
Using SAHI in my project YOLO yolo , code	3	277	March 17, 2025
Can I use SAHI Tiled Inference with YOLO11 Segmentation Model? YOLO support , curious	1	34	October 20, 2025
Ultralytics YOLO11 Released 🎉 Discussion showcase , ultralytics-official	1	622	October 1, 2024
Yolov5 to yolov11 YOLO yolov5 , question , code , yolo11	18	1378	January 16, 2025

Does SAHI with YOLO11 make sense

Measurements

YOLOv10 Series Performance

YOLOv11 Series Performance

sources

Related topics