Periodic confidence oscillations under 1px circular vertical shifts in YOLOv8 and RT-DETR (~32px fundamental at imgsz=640)

Context

I am analysing positional sensitivity in object detectors (YOLOv8 and RT-DETR via Ultralytics).

For a fixed image containing a single defect (tested on defects which go from like 1-4% of image area), I generate a circular vertical shift sweep:

  • Shift image by dy = 0..H-1 (1 px increments, circular with np.roll).

  • Run inference for each shift.

  • Track the matched detection (IoU-based matching to the same GT object).

  • Record confidence conf(dy).

Recall in these sweeps is ~1.0 (detection almost never drops).


What I observe

The confidence signal conf(dy) is approximately periodic.

I compute periodicity using:

  • Sliding windows (STEP=1)

  • Z-normalization per window to focus on patterns instead of absolute conf

  • Mean window-correlation by lag

Clear peaks appear in the lag spectrum.

Clean result (warp 640×640, imgsz=640)

Both YOLOv8 and RT-DETR show:

  • Dominant lag ≈ 32 px

  • Extremely Strong harmonics (many times basically the same as 32px and sometimes even higher by an extremely small amount): 64, 96, 128, …

Correlation values are high (≈0.95–0.99).

This is extremely stable across multiple images.


Resolution experiments

I warped the same base image to different square resolutions before inference:

  • 320×320

  • 640×640

  • 1280×1280

Inference always with imgsz=640.

Observed dominant lags (some show high lags for multiples but a trend is visible):

  • 320 → ~16

  • 640 → ~32

  • 1280 → ~64

After correcting for scale all collapse to ~32 px in input space.

So the fundamental periodicity appears invariant in input coordinates.


Rectangular image case (1700×1200)

Using original aspect ratio:

  • YOLOv8 (default inference uses LetterBox, aspect ratio + padding)
    → dominant period ≈ 85 px. 1700/20 = 85

  • RT-DETR (default inference uses scale_fill=True, warp without padding)
    → dominant period ≈ 60 px. 1200/20 = 60

These differences are explained by different vertical scaling factors in preprocessing.

After correcting for scale_y, both are again consistent with ~32 px in model input space.


What I am trying to understand

  1. What architectural mechanism would generate a ~32 px periodic confidence oscillation in input space?

  2. Is this most likely tied to:

    • FPN stride structure (e.g., stride 8/16/32 heads)?

    • Effective quantization grid in the detection head?

    • Positional encoding discretization?

    • Assignment / decode / NMS behaviour?

  3. Why does the fundamental appear so clean and stable across images?

  4. Why does RT-DETR (transformer-based) exhibit the same periodic behaviour?


Key observation

The phenomenon:

  • Is not random

  • Is not resolution-dependent (after normalization)

  • Persists across architectures, yolo8 various sizes and even yolov11

  • Has strong harmonics

  • Is tied to input pixel space

It behaves like a spatial sampling clock inside the detector.


I will attach:

  • YOLOv8 pattern on original 1700×1200 image on one defect i tested with (85px, first img)

  • YOLOv8 640×640 warp pattern on that same defect (32px pattern and multiples, 2nd img)

Please ask any questions you need, all help is welcome, thank you so much for the help.

What you’re seeing is a classic stride/phase (aliasing) effect from strided downsampling + multi-scale heads, and ~32 px is the smoking gun: it matches the largest feature-map stride (P5, stride 32) used by these detectors.

In Ultralytics YOLO, the backbone/neck produces feature maps at strides like 8/16/32 (for a 640 input), and the head makes decisions on those grids. A 1‑px input shift changes the phase of the object relative to those grids, so the same object is “sampled” slightly differently by the conv stack; confidence can oscillate even when the box stays matched and recall stays ~1.0. The strong harmonics (64/96/128…) are also consistent with mixing multiple strides plus repeated stride‑2 stages (and nonlinearities) in the backbone/neck.

This is much more likely to be driven by FPN stride structure / effective sampling grid than by NMS. NMS can add jitter if you sometimes match a different nearby candidate, but a clean 32‑px fundamental across images usually points to the network’s internal sampling lattice.

RT-DETR showing the same behavior is also expected: despite the transformer decoder, RT-DETR still relies on CNN/FPN-like multi-scale feature maps coming from a strided backbone/encoder, so it inherits the same translation variance. The YOLOv8 vs. RT-DETRv2 deep dive touches on these architectural differences, but both ultimately consume multi-scale grids.

A quick sanity check you can run is to print the model strides and confirm the “32” is present:

from ultralytics import YOLO

model = YOLO("yolov8n.pt")
print(model.model.stride)  # typically tensor([ 8., 16., 32.])

If you want to isolate whether post-processing contributes at all, try the same experiment with Ultralytics YOLO26 (native end-to-end, NMS-free) from the Ultralytics YOLO26 model docs. If the periodicity remains (often it will), that’s strong evidence it’s coming from the feature sampling/stride, not NMS.

If you share your exact predict() args (especially imgsz, conf, iou, and whether you’re using letterbox vs scale_fill) and one sample image, I can suggest a couple targeted ablations to confirm which stride level is dominating your matched detection.

Hello, thank you so much for answering!

I have already done some experiments such as:

  • Using yolov11, yolov8 with many sizes and yolov26 and they all exhibit the same periodicity pattern.

  • Switching the stride from 2 to 1 in the first conv layer in the yolov8.yaml and basically the same pattern emerges, which makes me think the FPN-lime multi-scale feature maps are the main cause of this pattern, because even if the backbone stride decreases, I conclude the FPN still creates a 20x20 feature map with P5, and that causes the pattern. When printing print(model.model.stride), it prints tensor([ 4., 8., 16.]) instead of tensor([ 8., 16., 32.]) like all the others.

  • I also used different image sizes to confirm findings, such as when, for example i do this 1px datasets for images with 640x640 (about 640 - H images per defect, with H being the size of the defect plus a little boundary to avoid the defect being partially cut off at the top and bottom borders) the pattern is 32px, if i do the same thing but for 960x960 the pattern becomes 48 and if i do the same thing with 1280x1280, the pattern becomes 64px, or 320x320 pattern becomes 16px. This aligns with the largest feature map stride (p5), which creates 20x20 feature maps, confirmin the formula image_size/20 = pattern

For the model.predict() since YOLO and RT-DETR by default do inference differently (YOLO adds padding/letterbox and RT-DETR does scale_fill=True), the differences in patterns make sense when the image is not squared. As I showed, these pattern differences basically disappear when the images are squared (and the inference image pre-processing is therefore not very relevant). The fact that you say that this is expected makes me more confident in my conclusions, since, although architecturally different, both models rely on **FPN-like multi-scale feature maps.

Lastly,** the model.predict() I am using is like this (annexed image), using default imgsz=640. I also tested with imgsz=1280 and did not see any significant differences in the patterns.

Thank you so much for all your help. I would really appreciate any further suggestions or knowledge you can share on this topic. Particularly relevant, but technically very complicated I am sure, is to understand if this pattern could be mitigated or if architecutrally, in YOLO, it is just not possible?

Hello, I am back again to add some more info.

I tested the two following YOLOv8 architectural modification strategies:

A.

Change made

  • Kept the original backbone and neck, including the stride-32 (P5/32) feature map computation.

  • Modified only the detection head so that Detect uses 2 inputs instead of 3:

    • Detect on P3/8 and P4/16

    • No Detect on P5/32

Result

  • The confidence signal still showed a strong ~32 px fundamental across datasets.

  • Harmonics (64, 96, 128, …) remained strong.

  • Conclusion from this experiment alone:

    • Removing the P5 prediction branch does not remove the ~32 px periodicity.

Interpretation

This indicates the ~32 px effect is not caused by the P5 detection output grid specifically. The stride-32 structure can still influence confidence because stride-32 features still exist and can leak into P4/P3 via FPN/PAN fusion.

Yolov8.yaml used (backbone kept the same):


head:

  # Top-down

  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]

  - [[-1, 6], 1, Concat, [1]]         # cat backbone P4

  - [-1, 3, C2f, [512]]               # 12



  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]

  - [[-1, 4], 1, Concat, [1]]         # cat backbone P3

  - [-1, 3, C2f, [256]]               # 15 (P3/8)


  # Bottom-up (PAN) to get P4 again

  - [-1, 1, Conv, [256, 3, 2]]

  - [[-1, 12], 1, Concat, [1]]        # cat head P4

  - [-1, 3, C2f, [512]]               # 18 (P4/16)


  # Detect only on P3 and P4

  - [[15, 18], 1, Detect, [nc]]       # Detect(P3, P4)


B

Change made

  • Removed the stride-32 stage in the backbone so the network never produces a 20×20 map at 640.

  • Removed all neck/head paths that would consume or reconstruct stride-32.

  • Built a consistent 2-level FPN/PAN using only:

    • P3/8

    • P4/16

Result

  • The dominant periodicity shifted from ~32 px to ~16 px in every dataset you showed.

  • 32 px remained present mainly as a harmonic (multiple of 16), but it was no longer the fundamental.

Interpretation

This is the causal “smoking gun”:

  • When the largest stride in the feature hierarchy is 32, the fundamental is ~32 px.

  • When the largest stride is reduced to 16, the fundamental becomes ~16 px.

So the periodicity is driven by the coarsest stride lattice present in the backbone/neck feature sampling, not by the P5 detection head itself.

Yolov8.yaml used:

backbone:

  # [from, repeats, module, args]

  - [-1, 1, Conv, [64, 3, 2]]        # 0 P1/2

  - [-1, 1, Conv, [128, 3, 2]]       # 1 P2/4

  - [-1, 3, C2f, [128, True]]        # 2

  - [-1, 1, Conv, [256, 3, 2]]       # 3 P3/8

  - [-1, 6, C2f, [256, True]]        # 4  (P3)

  - [-1, 1, Conv, [512, 3, 2]]       # 5 P4/16

  - [-1, 6, C2f, [512, True]]        # 6  (P4)

  - [-1, 1, SPPF, [512, 5]]          # 7  (still P4/16)




head:

  # Top-down: P4 -> P3

  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]  # 8  (P4->P3)

  - [[-1, 4], 1, Concat, [1]]                   # 9  cat backbone P3 (both 80x80)

  - [-1, 3, C2f, [256]]                         # 10 (P3/8)




  # Bottom-up: P3 -> P4

  - [-1, 1, Conv, [256, 3, 2]]                  # 11 (P3->P4)

  - [[-1, 6], 1, Concat, [1]]                   # 12 cat backbone P4 (both 40x40)

  - [-1, 3, C2f, [512]]                         # 13 (P4/16)




  # Detect on P3 and P4

  - [[10, 13], 1, Detect, [nc]]                 # 14 Detect(P3, P4)

Makes me wonder why simply changing the initial Conv stride on the default YOLO for 2 to 1 did not change anything in the previous experiment where I kept the head. The next experiment I will do is again change the first conv layer from stride=2 to 1 but in variation A where P5 head doesn’t exist.

Edit: Just tested on 640x640 images the simply changing stride from 2 to 1 and the 16px actually became the proeminent one, over 32px

Any feedback or knowledge on my experiments is greatly appreciated.