AFAIK, there aren’t any “standards” for training on the VisDrone dataset. There’s not pretrained model, so you’ll have to take your best guess. You could check the original publication https://arxiv.org/pdf/2001.06303
however from a quick glance, I didn’t see any mentions of the training arguments. If you’re using a larger model (large or xtra-large), you’ll probably need to train for a long time. You could start with 600/800 epochs and set a early stop of 50-100 epochs. For resolution, it might depend on several factors, but given you have an A100 to train on, maybe start with 1280 and then try to adjust for the optimal batch size. It might be worthwhile to try a few small training runs (20-30 epochs) to do a quick check for batch sizing. I might also consider testing for various image sizes. A 1280 resolution could work, but there could be just as good performance at 960 or 800, so doing a few quick tests with different image sizes could be useful. These are quick checks and may not provide an accurate answer, but it should give you somewhere reasonable to start from.
Thank you @BurhanQ for sharing this research paper and the detailed information!
This is very helpful for my VisDrone training setup.
I’ll review the paper and will implement the recommendations for imgsz configuration & epoch settings
If I encounter any queries or need further clarification while implementing these recommendations, I’ll ask you. Thank you for the guidance!
There aren’t official “standards,” but on A100, imgsz 960–1280 with early stopping (patience 80–120) works well; do a 20–30 epoch sanity run to pick imgsz/batch. Prefer YOLO11 for stability and speed—YOLO12 is not recommended due to training instability and heavy attention layers.
If helpful, the setup steps in the VisDrone dataset guide and the train settings reference are here: see the Ultralytics docs for the VisDrone dataset and the train settings page. Share your results.png if you want tuning suggestions.
Thank you @pderrenger for the detailed guidance! Yes, I conducted a sanity run and confirmed 1280 resolution performs best on my A100 setup for VisDrone small object detection.
Regarding the image size for paper reporting: I’ve noticed the inconsistency in research papers—some use 640×640, others use higher resolutions (960-1280), and many don’t explicitly report training resolution at all. This creates a challenge for fair comparison.
The inconsistency is unfortunate, but for practical purposes, generally you’ll have to determine the proper resolution for your use. Smaller object trend to mean higher resolutions, although there is a limit. If you can successfully detect objects at 1280, you can stay with that, or you can test lowering the resolution to find where the drop-off point is, where the model is unable to detect objects. The reason to do this is that lower resolutions generally tend towards faster inference speed. If inference speed is not a concern, then you don’t have to test lower resolutions, but it could be worth trying as an experiment (in my experience, speed becomes a concern eventually).
@BurhanQ Thank you for that clarification on the speed-accuracy tradeoff! I’ve now had a chance to run the resolution sweep you suggested, and the results are quite significant: When I reduced training resolution from 1280 to 640, I observed a -9 percentage point drop in mAP@50
Is -9 mAP typical for VisDrone when going from 1280 → 640?
I need suggestions for customizing the model for better mAP!!
Given the size of the objects are quite small in the VisDrone dataset, that’s not terribly surprising. You might try using imgsz=800 or imgsz=960 (really any multiple of 32) to see if the mAP drop is small enough to be acceptable.