Best approach to count stacked cardboard boxes using CCTV (2D RGB only)

Hi everyone :waving_hand:

I’m working on a real-world counting problem and would love to get some advice or ideas from the community.

I need to count the number of cardboard boxes in a warehouse — similar to this example image:

My setup:

  • Only one CCTV camera (2D RGB) available — no depth or stereo sensors.

  • Boxes are stacked tightly and often partially occluded.

  • The camera is fixed, so the viewing angle doesn’t change.

What I’ve tried / considered:

  • Object detection (YOLOv11 / OBB): struggles with overlapping boxes.

  • Instance segmentation (YOLOv11-seg or SAM): works better, but still has many false positives and under-segmentation (some clusters are merged).

  • Counting by area or volume estimation: not accurate due to perspective distortion.

My questions:

  1. Is instance segmentation still the best approach for this case, or is there a more robust method to handle heavy occlusion?

  2. Are there any recommended post-processing steps (e.g., edge-based mask refinement or geometric heuristics) to split merged boxes?

  3. Would perspective correction or homography calibration help improve segmentation accuracy?

  4. Any best practices for training YOLOv11-seg specifically for stacked box scenarios?

I’m open to any suggestions — pipeline design, dataset tips, or even loss function tweaks that could improve instance separation.

Thanks in advance :folded_hands:

Questions to help get to an answer:

  1. Are the boxes generally all the same size like shown in the image?
  2. I understand the aim is to count the boxes, but what’s the overall goal? Where does the box count data get sent to?
  3. Will there be multiple pallets (like your example image) or a single pallet in the frame? If it’s multiple, can it be changed to be single?
1 Like

Also, FWIW, I using a YOLOE model, I was able to get this result

Here’s the exact code I used:

from pathlib import Path

from ultralytics import YOLOE

p = Path.home() / "Downloads"
f = p / "boxes.jpg"

model = YOLOE("yoloe-11l-seg.pt")  # Large segmentation model
names = [
    "box", 
    "bin", 
    "handtruck", 
    "person", 
    "garage door", 
    "forklift", 
    "pallet",
    "",
]  # other classes that might be in the image to help separate detections
model.set_classes(names, model.get_text_pe(names))

results = model.predict(f, iou=0.11, conf=0.06)
results[0].show(masks=True, labels=False)

2 Likes