Hi everyone ![]()
I’m working on a real-world counting problem and would love to get some advice or ideas from the community.
I need to count the number of cardboard boxes in a warehouse — similar to this example image:
My setup:
-
Only one CCTV camera (2D RGB) available — no depth or stereo sensors.
-
Boxes are stacked tightly and often partially occluded.
-
The camera is fixed, so the viewing angle doesn’t change.
What I’ve tried / considered:
-
Object detection (YOLOv11 / OBB): struggles with overlapping boxes.
-
Instance segmentation (YOLOv11-seg or SAM): works better, but still has many false positives and under-segmentation (some clusters are merged).
-
Counting by area or volume estimation: not accurate due to perspective distortion.
My questions:
-
Is instance segmentation still the best approach for this case, or is there a more robust method to handle heavy occlusion?
-
Are there any recommended post-processing steps (e.g., edge-based mask refinement or geometric heuristics) to split merged boxes?
-
Would perspective correction or homography calibration help improve segmentation accuracy?
-
Any best practices for training YOLOv11-seg specifically for stacked box scenarios?
I’m open to any suggestions — pipeline design, dataset tips, or even loss function tweaks that could improve instance separation.
Thanks in advance ![]()

